Python - 池不使用所有核心

Python - pooling doesn't use all cores

我正在使用 multiprocessing 包 (from multiprocessing.dummy import Pool) 中的 Pool。我编写了一个读取文本文件并对其进行预处理以供将来使用的函数。我有大约 20,000 个这样的文本文件，因此我想并行处理这个过程——为此我使用了池。我的远程服务器上有 32 个内核，即运行代码，因此我尝试打开 70 个进程（我也尝试了 less，问题仍然存在）- 这就是我的系统监视器的样子：

可以看出，32 个内核中有 16 个根本不工作！

如有任何帮助，我们将不胜感激。

正如我在评论中所说，所有 multiprocessing.dummy 结构都旨在使用常规线程模拟多处理接口，这对于测试、调试、分析等非常有用。或者，正如官方文档所说：

multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.

而 Python (CPython) threading 使用真正的系统线程，因此理论上可以让你的线程代码在不同的 CPU 上执行，由于可怕的GIL 这些线程中没有两个会同时运行。该规则也有例外，所有抽象系统调用和等待事件（如 I/O）的任务都可以并行执行，但当处理移动到 Python 域时，它将被锁定由 GIL 退出，并且在选择代码计数器切换其上下文之前不允许继续执行。

长话短说，如果您想通过 multiprocessing 池利用多个核心，请不要使用 multiprocessing.dummy 中的改编和抽象（这也适用于其他 dummy 包）并使用根 multiprocessing 模块本身 - 在您的情况下，multiprocessing.pool.Pool 。

也就是说，考虑到 threading 模块没有池接口，我经常发现自己使用 multiprocessing.dummy.Pool（或 multiprocessing.pool.ThreadPool）代替 I/O 当共享内存比 shared 处理和它产生的开销更重要时，沉重的东西（即不受 GIL 限制）。很可能即使切换到 multiprocessing.pool.Pool 如果您在抓取时不进行繁重的 post 处理，您也不会注意到太多差异文件。

Python - 池不使用所有核心

Python - pooling doesn't use all cores

python

performance

multithreading

multiprocessing

remote-server