在 Python Pandas 中使用 if 条件

Question

我有一个数据框，我正在尝试对数据执行 if 函数，比如说如果 A 列是 'ON' 那么 E 列应该是 Col C + Col D 否则 E 列是 MAX(col B, col C)-col C + col D).

df1:

T_ID   A     B    C     D
1      ON   100   90    0
2      OFF  150   120  -20
3      OFF  200   150   0
4      ON   400   320   0
5      ON   100    60  -10
6      ON   250   200   0

结果数据框

T_ID   A     B    C     D    E
1      ON   100   90    0     90
2      OFF  150   120  -20    10
3      OFF  200   150   0     50
4      ON   400   320   0    320
5      ON   100    60  -10    50
6      ON   250   200   0    200

我正在使用以下代码，有什么建议可以让我以更好的方式做到这一点吗？

condition = df1['A'].eq('ON')

df1['E'] = np.where(condition, df1['C'] + df1['D'], max(df1['B'],df1['C'])-df1['C']+df1['D'])

Answer 1

我认为np.where这是个好方法。我工作 numpy.maximum，max 加薪 error：

condition = df1['A'].eq('ON')

df1['E'] = np.where(condition, 
                    df1['C'] + df1['D'], 
                    np.maximum(df1['B'],df1['C'])-df1['C']+df1['D'])
print (df1)
   T_ID    A    B    C   D    E
0     1   ON  100   90   0   90
1     2  OFF  150  120 -20   10
2     3  OFF  200  150   0   50
3     4   ON  400  320   0  320
4     5   ON  100   60 -10   50
5     6   ON  250  200   0  200

df1['E'] = np.where(condition, 
                    df1['C'] + df1['D'], 
                    max(df1['B'],df1['C'])-df1['C']+df1['D'])
print (df1)

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

这里 apply 是更糟糕的解决方案，因为引擎盖下的循环非常慢：

#6k rows -> for sample data np.where is 265 times faster like apply
df1 = pd.concat([df1] * 1000, ignore_index=True)
print (df1)


In [73]: %%timeit
    ...: condition = df1['A'].eq('ON')
    ...: 
    ...: df1['E1'] = np.where(condition, 
    ...:                     df1['C'] + df1['D'], 
    ...:                     np.maximum(df1['B'],df1['C'])-df1['C']+df1['D'])
    ...:                     
1.91 ms ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [74]: %%timeit
    ...: df1['E2'] = df1.apply(createE, axis=1)
    ...: 
507 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Answer 2

我认为 apply 函数会是更好的解决方案。代码可能是这样的：

def createE(row):
   if row.A == 'ON':
      return row.C + row.D
   else:
      return max(row.B, row.C) - row.C + row.D
df1['E'] = df1.apply(createE)

在 https://www.geeksforgeeks.org/create-a-new-column-in-pandas-dataframe-based-on-the-existing-columns/

查看更多关于 apply 的信息

在 Python Pandas 中使用 if 条件

Use if condition in Python Pandas

python

excel

formula

dataframe

pandas