在 python 中为非常大的数据集执行迭代函数的更好方法

Question

for i in range(1,len(df_raw)):
    if df_raw.loc[i-1, 'A']!= 0 & df_raw.loc[i, 'A']== 0 & df_raw.loc[i+1, 'A']== 0:
        df_raw.loc[i,'B'] = df_raw.loc[i+5,'B']

大家好，我正在尝试运行以上代码行对我的数据。直到数据范围为 100,000-150,000 行时，我能够运行此代码，但对于更大的数据，它只是保持运行无输出。你能帮我更好地编写这段代码以获得更大的数据量吗？

Answer 1

您的代码可能只是需要很长时间才能完成运行，因为它必须执行大量步骤。（超过 150,000）。我建议您做几件事：

看看您是否需要运行为数组中的每个元素编写代码。否则，这将显着提高性能。
检查 top/task manager/system 监视器（取决于操作系统）并查看您是否运行内存不足。
改变你的按位和 (&) 以获得更惯用和更快的（短路）and
Profile your code
添加进度条：
在命令行：pip install tqdm
在你的代码中

from tqdm import tqdm

for i in tqdm(range(1,len(df_raw))):
    if df_raw.loc[i-1, 'A'] != 0 and df_raw.loc[i, 'A'] == 0 and df_raw.loc[i+1, 'A']== 0:
        df_raw.loc[i,'B'] = df_raw.loc[i+5,'B']

考虑 multiprocessing。如果您可以将代码拆分成离散的段，则可以在多核系统上将其并行化。这可能很难正确执行，因此我将从上述步骤开始。如果您决定采用这条路线并需要帮助，请使用更完整的代码示例编辑您的问题。

Answer 2

我认为您缺少的有效执行此类逻辑的方法是 shift。这是我的建议：

df_raw = df_raw.sort_index() # Optional, if index is not sorted
df_raw['A_is_zero'] = df_raw['A'] == 0
df_raw['prev_A_is_zero'] = df_raw['A_is_zero'].shift(1).fillna(True)
df_raw['next_A_is_zero'] = df_raw['A_is_zero'].shift(-1).fillna(False)
B_to_change = df_raw['A_is_zero'] & df_raw['next_A_is_zero'] & ~df_raw['prev_A_is_zero']
df_raw.loc[B_to_change, 'B'] = df_raw['B'].shift(-5).loc[B_to_change]

由于您没有提供示例数据框，所以我没有对其进行测试，所以我不能保证它会起作用，但我认为我提供了实现解决方案的主要思想。例如，在最后一行之前的四行中，如果 B_to_change 为 True，您将在 'B' 中得到 NaN。另一件事是您将 .loc 与整数一起使用，但我不知道您的索引是否是一个范围，在这种情况下我的第一行是无用的，或者如果不是并且您打算使用 iloc（请参阅 this link 关于 loc / iloc 的区别），在这种情况下，我的第一行应该被删除，因为它不会导致预期的结果。

编辑：

my requirements has some iterative conditional sequential operations, e.g.:
for i in range(1, len(df_raw)):
    if df_raw.loc[i, 'B'] != 0:
        df_raw.loc[i, 'A'] = df_raw.loc[i-1, 'A']

在这种情况下（您应该在问题中指定），您可以使用前向填充，如下所示：

B_is_zero = df_raw['B'] == 0
df_raw['new_A'] = None
df_raw.loc[B_is_zero, 'new_A'] = df_raw.loc[B_is_zero, 'A'] 
df_raw['A'] = df_raw['new_A'].fillna(method='ffill')

再一次，您应该注意如何处理第一行 'B' 非零的边缘情况。

在 python 中为非常大的数据集执行迭代函数的更好方法

Better way to execute iterative function for very large dataset in python

python

iteration

for-loop

pandas