如何在 pandas 中使用矢量化而不是 for 循环
How to use vectorization instead of for loop in pandas
我正在尝试为我的工作构建机器学习算法。我用于训练和测试的数据有 17k 行和 20 列。我尝试基于其他两列添加一个新列,但我编写的 for 循环太慢(执行 3 秒)
for i in range(0, len(model_olculeri)):
if (model_olculeri["Bel"][i] != 0) and (model_olculeri["Basen"][i] != 0):
sum_column = (model_olculeri["Bel"][i]) / (model_olculeri["Basen"][i])
model_olculeri["Waist to Hip Ratio"][i] = sum_column
我阅读了有关 pandas 和 numpy 向量化的文章,而不是 pandas 数据帧上的 for 循环,它似乎更快更有效。如何为我的 for 循环实现这种矢量化?非常感谢。
创建布尔掩码并将其用于过滤:
m = (model_olculeri["Bel"] != 0) & (model_olculeri["Basen"] != 0)
model_olculeri.loc[m,"Waist to Hip Ratio"] = model_olculeri.loc[m, "Bel"] / model_olculeri.loc[m,"Basen"]
选择:
model_olculeri.loc[m,"Waist to Hip Ratio"] = model_olculeri["Bel"] / model_olculeri["Basen"]
或在numpy.where
中设置新值:
model_olculeri["Waist to Hip Ratio"] = np.where(m, model_olculeri["Bel"] / model_olculeri["Basen"], np.nan)
使用 query
和 pipe
的链式解决方案
model_olculeri.query("Bel != 0 & Basen != 0").pipe(lambda x:x.assign(Waist to Hip Ratio = x.Bel/x.Basen)
我正在尝试为我的工作构建机器学习算法。我用于训练和测试的数据有 17k 行和 20 列。我尝试基于其他两列添加一个新列,但我编写的 for 循环太慢(执行 3 秒)
for i in range(0, len(model_olculeri)):
if (model_olculeri["Bel"][i] != 0) and (model_olculeri["Basen"][i] != 0):
sum_column = (model_olculeri["Bel"][i]) / (model_olculeri["Basen"][i])
model_olculeri["Waist to Hip Ratio"][i] = sum_column
我阅读了有关 pandas 和 numpy 向量化的文章,而不是 pandas 数据帧上的 for 循环,它似乎更快更有效。如何为我的 for 循环实现这种矢量化?非常感谢。
创建布尔掩码并将其用于过滤:
m = (model_olculeri["Bel"] != 0) & (model_olculeri["Basen"] != 0)
model_olculeri.loc[m,"Waist to Hip Ratio"] = model_olculeri.loc[m, "Bel"] / model_olculeri.loc[m,"Basen"]
选择:
model_olculeri.loc[m,"Waist to Hip Ratio"] = model_olculeri["Bel"] / model_olculeri["Basen"]
或在numpy.where
中设置新值:
model_olculeri["Waist to Hip Ratio"] = np.where(m, model_olculeri["Bel"] / model_olculeri["Basen"], np.nan)
使用 query
和 pipe
model_olculeri.query("Bel != 0 & Basen != 0").pipe(lambda x:x.assign(Waist to Hip Ratio = x.Bel/x.Basen)