我们应该什么时候调用 multiprocessing.Pool.join?

When should we call multiprocessing.Pool.join?

我正在使用 'multiprocess.Pool.imap_unordered' 如下

from multiprocessing import Pool
pool = Pool()
for mapped_result in pool.imap_unordered(mapping_func, args_iter):
    do some additional processing on mapped_result

我需要在 for 循环之后调用 pool.closepool.join 吗?

不,你不需要,但如果你不再打算使用游泳池,这可能是个好主意。

Tim Peters 在 this SO post 中说了调用 pool.closepool.join 的原因:

As to Pool.close(), you should call that when - and only when - you're never going to submit more work to the Pool instance. So Pool.close() is typically called when the parallelizable part of your main program is finished. Then the worker processes will terminate when all work already assigned has completed.

It's also excellent practice to call Pool.join() to wait for the worker processes to terminate. Among other reasons, there's often no good way to report exceptions in parallelized code (exceptions occur in a context only vaguely related to what your main program is doing), and Pool.join() provides a synchronization point that can report some exceptions that occurred in worker processes that you'd otherwise never see.

当我不使用 pool.close()pool.join() 时,我遇到了与 Memory usage keep growing with Python's multiprocessing.pool 相同的内存问题,当我使用 pool.map() 和计算 Levenshtein 距离的函数时。该函数运行良好,但在 Win7 64 位机器上没有正确收集垃圾,并且每次调用该函数时内存使用量都会失控,直到它导致整个操作系统崩溃。这是修复泄漏的代码:

stringList = []
for possible_string in stringArray:
    stringList.append((searchString,possible_string))

pool = Pool(5)
results = pool.map(myLevenshteinFunction, stringList)
pool.close()
pool.join()

关闭并加入池后,内存泄漏消失了。