Python 使用 Pool 的多处理递归失控

Question

我正在尝试使我的 pandas 计算的昂贵部分并行以加快速度。

我已经设法让 Multiprocessing.Pool 使用一个简单的例子：

import multiprocessing as mpr
import numpy as np

def Test(l):
  for i in range(len(l)):
    l[i] = i**2
  return l

t = list(np.arange(100))
L = [t,t,t,t]
if __name__ == "__main__":
  pool = mpr.Pool(processes=4)
  E = pool.map(Test,L)
  pool.close()
  pool.join()

这里没问题。现在我自己的算法有点复杂，我不能在这里 post 它的全部荣耀和可怕，所以我将使用一些伪代码来概述我在那里做的事情：

import pandas as pd
import time
import datetime as dt
import multiprocessing as mpr
import MPFunctions as mpf --> self-written worker functions that get called for the multiprocessing
import ClassGetDataFrames as gd  --> self-written class that reads in all the data and puts it into dataframes

=== Settings

=== Use ClassGetDataFrames to get data

=== Lots of single-thread calculations and manipulations on the dataframe

=== Cut dataframe into 4 evenly big chunks, make list of them called DDC

if __name__ == "__main__":
  pool = mpr.Pool(processes=4)
  LLT = pool.map(mpf.processChunks,DDC)
  pool.close()
  pool.join()

=== Join processed Chunks LLT back into one dataframe

=== More calculations and manipulations

=== Data Output

当我运行这个脚本时，会发生以下情况：

读入数据
它执行所有计算和操作，直到 Pool 语句。
突然又读入了四倍的数据。
然后同时进入主脚本四重
整个事情递归级联并变得混乱。

我以前看过，如果你不小心就会发生这种情况，但我不知道为什么会发生在这里。我的多处理代码受所需的 name-main-statement 保护（我在 Win7 64 上），它只有 4 行长，它有 close 和 join 语句，它调用一个定义的辅助函数，然后调用第二个辅助函数一个循环，就是这样。据我所知，它应该只创建包含四个进程的池，从导入的脚本中调用四个进程，关闭池并等待一切完成，然后继续执行脚本。在旁注中，我首先在同一个脚本中使用了 worker 函数，行为是相同的。而不是仅仅执行池中的操作，它似乎四次重新启动整个脚本。

任何人都可以告诉我什么可能导致这种行为？我似乎缺少对 Python 的多处理行为的一些重要理解。

另外我不知道这是否重要，我在我公司大型机上的虚拟机上。

我必须使用单独的进程而不是池吗？

Answer 1

我设法通过将整个脚本放入 if __name__ == "__main__": 语句中来使其工作，而不仅仅是多处理部分。

Python 使用 Pool 的多处理递归失控

Python Multiprocessing using Pool goes recursively haywire

python

pandas

python-multiprocessing