在 Python Pandas 中使用 if 条件
Use if condition in Python Pandas
我有一个数据框,我正在尝试对数据执行 if 函数,比如说如果 A 列是 'ON' 那么 E 列应该是 Col C + Col D
否则 E 列是 MAX(col B, col C)-col C + col D)
.
df1:
T_ID A B C D
1 ON 100 90 0
2 OFF 150 120 -20
3 OFF 200 150 0
4 ON 400 320 0
5 ON 100 60 -10
6 ON 250 200 0
结果数据框
T_ID A B C D E
1 ON 100 90 0 90
2 OFF 150 120 -20 10
3 OFF 200 150 0 50
4 ON 400 320 0 320
5 ON 100 60 -10 50
6 ON 250 200 0 200
我正在使用以下代码,有什么建议可以让我以更好的方式做到这一点吗?
condition = df1['A'].eq('ON')
df1['E'] = np.where(condition, df1['C'] + df1['D'], max(df1['B'],df1['C'])-df1['C']+df1['D'])
我认为np.where
这是个好方法。我工作 numpy.maximum
,max
加薪 error
:
condition = df1['A'].eq('ON')
df1['E'] = np.where(condition,
df1['C'] + df1['D'],
np.maximum(df1['B'],df1['C'])-df1['C']+df1['D'])
print (df1)
T_ID A B C D E
0 1 ON 100 90 0 90
1 2 OFF 150 120 -20 10
2 3 OFF 200 150 0 50
3 4 ON 400 320 0 320
4 5 ON 100 60 -10 50
5 6 ON 250 200 0 200
df1['E'] = np.where(condition,
df1['C'] + df1['D'],
max(df1['B'],df1['C'])-df1['C']+df1['D'])
print (df1)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
这里 apply
是更糟糕的解决方案,因为引擎盖下的循环非常慢:
#6k rows -> for sample data np.where is 265 times faster like apply
df1 = pd.concat([df1] * 1000, ignore_index=True)
print (df1)
In [73]: %%timeit
...: condition = df1['A'].eq('ON')
...:
...: df1['E1'] = np.where(condition,
...: df1['C'] + df1['D'],
...: np.maximum(df1['B'],df1['C'])-df1['C']+df1['D'])
...:
1.91 ms ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [74]: %%timeit
...: df1['E2'] = df1.apply(createE, axis=1)
...:
507 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
我认为 apply
函数会是更好的解决方案。
代码可能是这样的:
def createE(row):
if row.A == 'ON':
return row.C + row.D
else:
return max(row.B, row.C) - row.C + row.D
df1['E'] = df1.apply(createE)
在 https://www.geeksforgeeks.org/create-a-new-column-in-pandas-dataframe-based-on-the-existing-columns/
查看更多关于 apply
的信息
我有一个数据框,我正在尝试对数据执行 if 函数,比如说如果 A 列是 'ON' 那么 E 列应该是 Col C + Col D
否则 E 列是 MAX(col B, col C)-col C + col D)
.
df1:
T_ID A B C D
1 ON 100 90 0
2 OFF 150 120 -20
3 OFF 200 150 0
4 ON 400 320 0
5 ON 100 60 -10
6 ON 250 200 0
结果数据框
T_ID A B C D E
1 ON 100 90 0 90
2 OFF 150 120 -20 10
3 OFF 200 150 0 50
4 ON 400 320 0 320
5 ON 100 60 -10 50
6 ON 250 200 0 200
我正在使用以下代码,有什么建议可以让我以更好的方式做到这一点吗?
condition = df1['A'].eq('ON')
df1['E'] = np.where(condition, df1['C'] + df1['D'], max(df1['B'],df1['C'])-df1['C']+df1['D'])
我认为np.where
这是个好方法。我工作 numpy.maximum
,max
加薪 error
:
condition = df1['A'].eq('ON')
df1['E'] = np.where(condition,
df1['C'] + df1['D'],
np.maximum(df1['B'],df1['C'])-df1['C']+df1['D'])
print (df1)
T_ID A B C D E
0 1 ON 100 90 0 90
1 2 OFF 150 120 -20 10
2 3 OFF 200 150 0 50
3 4 ON 400 320 0 320
4 5 ON 100 60 -10 50
5 6 ON 250 200 0 200
df1['E'] = np.where(condition,
df1['C'] + df1['D'],
max(df1['B'],df1['C'])-df1['C']+df1['D'])
print (df1)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
这里 apply
是更糟糕的解决方案,因为引擎盖下的循环非常慢:
#6k rows -> for sample data np.where is 265 times faster like apply
df1 = pd.concat([df1] * 1000, ignore_index=True)
print (df1)
In [73]: %%timeit
...: condition = df1['A'].eq('ON')
...:
...: df1['E1'] = np.where(condition,
...: df1['C'] + df1['D'],
...: np.maximum(df1['B'],df1['C'])-df1['C']+df1['D'])
...:
1.91 ms ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [74]: %%timeit
...: df1['E2'] = df1.apply(createE, axis=1)
...:
507 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
我认为 apply
函数会是更好的解决方案。
代码可能是这样的:
def createE(row):
if row.A == 'ON':
return row.C + row.D
else:
return max(row.B, row.C) - row.C + row.D
df1['E'] = df1.apply(createE)
在 https://www.geeksforgeeks.org/create-a-new-column-in-pandas-dataframe-based-on-the-existing-columns/
查看更多关于apply
的信息