如何在 pandas 中使用矢量化而不是 for 循环

How to use vectorization instead of for loop in pandas

我正在尝试为我的工作构建机器学习算法。我用于训练和测试的数据有 17k 行和 20 列。我尝试基于其他两列添加一个新列,但我编写的 for 循环太慢(执行 3 秒)

for i in range(0, len(model_olculeri)):
    if (model_olculeri["Bel"][i] != 0) and (model_olculeri["Basen"][i] != 0):
        sum_column = (model_olculeri["Bel"][i]) / (model_olculeri["Basen"][i])
        model_olculeri["Waist to Hip Ratio"][i] = sum_column

我阅读了有关 pandas 和 numpy 向量化的文章,而不是 pandas 数据帧上的 for 循环,它似乎更快更有效。如何为我的 for 循环实现这种矢量化?非常感谢。

创建布尔掩码并将其用于过滤:

m = (model_olculeri["Bel"] != 0) & (model_olculeri["Basen"] != 0)
model_olculeri.loc[m,"Waist to Hip Ratio"] = model_olculeri.loc[m, "Bel"] / model_olculeri.loc[m,"Basen"]

选择:

model_olculeri.loc[m,"Waist to Hip Ratio"] = model_olculeri["Bel"] / model_olculeri["Basen"]

或在numpy.where中设置新值:

model_olculeri["Waist to Hip Ratio"] = np.where(m, model_olculeri["Bel"] / model_olculeri["Basen"], np.nan)

使用 querypipe

的链式解决方案
model_olculeri.query("Bel != 0 & Basen != 0").pipe(lambda x:x.assign(Waist to Hip Ratio =  x.Bel/x.Basen)