使用多个条件语句逐行向量化 pandas df

Question

我试图避免循环在 pandas df 的每行基础上应用一个函数。我看过很多矢量化示例，但还没有遇到任何可以完全工作的东西。最终，我试图添加一个额外的 df 列，其中包含成功条件的总和，每个条件按行指定一个值。

我看过 np.apply_along_axis 但这只是一个隐藏循环，np.where 但我看不到它在我正在检查的 25 个条件下有效

              A         B         C  ...         R         S         T
0  0.279610  0.307119  0.553411  ...  0.897890  0.757151  0.735718
1  0.718537  0.974766  0.040607  ...  0.470836  0.103732  0.322093
2  0.222187  0.130348  0.894208  ...  0.480049  0.348090  0.844101
3  0.834743  0.473529  0.031600  ...  0.049258  0.594022  0.562006
4  0.087919  0.044066  0.936441  ...  0.259909  0.979909  0.403292

[5 rows x 20 columns]

def point_calc(row):
    points = 0
    if row[2] >= row[13]:
        points += 1
    if row[2] < 0:
        points -= 3
    if row[4] >= row[8]:
        points += 2
    if row[4] < row[12]:
        points += 1
    if row[16] == row[18]:
        points += 4
    return points

points_list = []
for indx, row in df.iterrows():
    value = point_calc(row)
    points_list.append(value)

df['points'] = points_list

这显然效率不高，但我不确定如何向量化我的代码，因为它需要 df 中每一列的每行值来获得条件的自定义总和。

如能为我指明正确的方向，我们将不胜感激。

谢谢。

更新：我可以用 df.apply.

替换 df.iterrows 部分来获得更快的速度

df['points'] = df.apply(lambda row: point_calc(row), axis=1)

更新 2：我按如下方式更新了函数，并大大减少了运行时间，使用 df.apply 和初始函数的速度提高了 10 倍。

def point_calc(row):
    a1 = np.where(row[:,2]) >= row[:,13], 1,0)
    a2 = np.where(row[:,2] < 0, -3, 0) 
    a3 = np.where(row[:,4] >= row[:,8])
    etc.
    all_points = a1 + a2 + a3 + etc.
    return all_points

df['points'] = point_calc(df.to_numpy())

我仍在努力的是在函数本身上使用 np.vectorize 以查看是否也可以对其进行改进。

Answer 1

您可以通过以下方式尝试：

# this is a small version of your dataframe
df = pd.DataFrame(np.random.random((10,4)), columns=list('ABCD'))

看起来是这样的：

    A           B           C           D
0   0.724198    0.444924    0.554168    0.368286
1   0.512431    0.633557    0.571369    0.812635
2   0.680520    0.666035    0.946170    0.652588
3   0.467660    0.277428    0.964336    0.751566
4   0.762783    0.685524    0.294148    0.515455
5   0.588832    0.276401    0.336392    0.997571
6   0.652105    0.072181    0.426501    0.755760
7   0.238815    0.620558    0.309208    0.427332
8   0.740555    0.566231    0.114300    0.353880
9   0.664978    0.711948    0.929396    0.014719

您可以创建一个系列来计算您的分数并用零初始化：

points = pd.Series(0, index=df.index)

看起来是这样的：

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
dtype: int64

之后您可以根据需要逐行添加和减去值：括号内的条件选择条件为真的行。因此 -= 和 += 仅应用于那些行。

points.loc[df.A < df.C] += 1
points.loc[df.B <    0] -= 3

最后，如果需要，您可以将系列的值提取为 numpy 数组（可选）：

point_list = points.values

这是否解决了您的问题？

使用多个条件语句逐行向量化 pandas df

vectoring pandas df by row with multiple conditional statements

python

vectorization

conditional-statements

dataframe

pandas