如何对函数而不是循环使用多处理？

Question

我写了一个包含大约 400 行的函数。该函数在数据帧上进行某种数据科学。当我运行函数时，它花费了大约 10 秒。我需要运行这个函数 100 次，在循环中的每个 iteration.Therefore 中使用不同的参数，我调用该函数 100 次，每次迭代我都输入 4 个不同的参数。总共用了大约15分钟。因此我想使用 CPU Parallelization。我如何在 python 中使用多处理来提供并行化并缩短运行时间？

代码示例：

result = []
for i range(100):
    result.append(searching_algorithm(a[i], b[i], c[i], d[i]))

Answer 1

您没有说明 a、b、c 和 d 是什么类型的列表。这些列表中的元素必须能够使用 pickle 模块进行序列化，因为它们需要传递给一个函数，该函数将由位于不同地址 space 的进程运行执行。为了争论起见，我们假设它们是至少长度为 100 的整数列表。

您也没有说明您运行在哪个平台下（Windows？MacOS？Linux？）。当您使用 multiprocessing 标记问题时，您应该也使用平台标记问题。如何组织代码在某种程度上取决于平台。在下面的代码中，我为那些使用 spawn 创建新进程的平台选择了最有效的安排，即 Windows。但这在 MacOS 和 Linux 上也很有效，默认情况下使用 fork 创建新进程。您可以研究 spawn 和 fork 与创建新进程相关的含义。最终要成为内存和 CPU 高效，您只需要 if __name__ == '__main__': 之外的全局变量阻止那些 必须是 全局的变量。这就是为什么我有函数本地列表的声明。

然后使用 concurrent.futures 模块我们有：

from concurrent.futures import ProcessPoolExecutor

def searching_algorithm(a, b, c, d):
    ...
    return a * b * c * d

def main():
    # We assume a, b, c and d each have 100 or more elements:
    a = list(range(1, 101))
    b = list(range(2, 102))
    c = list(range(3, 103))
    d = list(range(4, 104))
    # Use all CPU cores:
    with ProcessPoolExecutor() as executor:
        result = list(executor.map(searching_algorithm, a[0:100], b[0:100], c[0:100], d[0:100]))
    print(result[0], result[-1])

# Required for Windows:
if __name__ == '__main__':
    main()

打印：

24 106110600

要改用 multiprocessing 模块：

from multiprocessing import Pool

def searching_algorithm(a, b, c, d):
    ...
    return a * b * c * d

def main():
    # We assume a, b, c and d each have 100 or more elements:
    a = list(range(1, 101))
    b = list(range(2, 102))
    c = list(range(3, 103))
    d = list(range(4, 104))
    # Use all CPU cores:
    with Pool() as pool:
        result = pool.starmap(searching_algorithm, zip(a[0:100], b[0:100], c[0:100], d[0:100]))
    print(result[0], result[-1])

# Required for Windows:
if __name__ == '__main__':
    main()

在两个编码示例中，如果列表 a、b、c 和 d 恰好包含 100 个元素，则无需对它们进行切片比如a[0:100]；只需传递列表本身，例如：

        result = list(executor.map(searching_algorithm, a, b, c, d))

如何对函数而不是循环使用多处理？

How to use multiprocessing for a function instead of loop?

python

windows

parallel-processing

multiprocessing