为什么这个并行搜索和替换没有使用 100% 的 CPU？

Question

我有一个很长的推文列表（200 万条），我使用正则表达式来搜索和替换这些推文中的文本。

I 运行使用 joblib.Parallel map（joblib 是 scikit-learn 使用的并行后端）。

我的问题是我可以在 Windows' 任务管理器中看到我的脚本没有使用每个 CPU 的 100%。它不会使用 100% 的 RAM 或磁盘。所以我不明白为什么它不会走得更快。

可能某处存在同步延迟，但我无法找到是什么地方。

代码：

# file main.py
import re
from joblib import delayed, Parallel

def make_tweets():
    tweets = load_from_file()  # this is list of strings

    regex = re.compile(r'a *a|b *b')  # of course more complex IRL, with lookbehind/forward
    mydict = {'aa': 'A', 'bb': 'B'}  

    def handler(match):
        return mydict[match[0].replace(' ', '')]

    def replace_in(tweet)
        return re.sub(regex, handler, tweet)

    # -1 mean all cores
    # I have 6 cores that can run 12 threads
    with Parallel(n_jobs=-1) as parallel:
        tweets2 = parallel(delayed(replace_in)(tweet) for tweet in tweets)

    return tweets2

这是任务管理器：

编辑：最后一句话

答案是 joblib 同步减慢了工作进程：joblib 将推文分成小块（一个接一个？）发送给工作进程，这让他们等待。使用 multiprocessing.Pool.map 和 len(tweets)/cpu_count() 的块大小使工作人员利用 100% 的 CPU。

使用joblib，运行宁时间约为1200万。使用多处理是 400 万。使用 multiprocessing，每个工作线程消耗大约 50mb 内存。

Answer 1

玩了一会儿之后，我认为这是因为 joblib 将所有时间都花在协调并行运行的所有事情上，而没有时间实际做任何有用的工作。至少对我来说 OSX 和 Linux — 我没有任何 MS Windows 机器

我首先加载包，提取您的代码，然后生成一个虚拟文件：

from random import choice
import re

from multiprocessing import Pool
from joblib import delayed, Parallel

regex = re.compile(r'a *a|b *b')  # of course more complex IRL, with lookbehind/forward
mydict = {'aa': 'A', 'bb': 'B'}  

def handler(match):
    return mydict[match[0].replace(' ', '')]

def replace_in(tweet):
    return re.sub(regex, handler, tweet)

examples = [
    "Regex replace isn't that computationally expensive... I would suggest using Pandas, though, rather than just a plain loop",
    "Hmm I don't use pandas anywhere else, but if it makes it faster, I'll try! Thanks for the suggestion. Regarding the question: expensive or not, if there is no reason for it to use only 19%, it should use 100%"
    "Well, is tweets a generator, or an actual list?",
    "an actual list of strings",
    "That might be causing the main process to have the 419MB of memory, however, that doesn't mean that list will be copied over to the other processes, which only need to work over slices of the list",
    "I think joblib splits the list in roughly equal chunks and sends these chunks to the worker processes.",
    "Maybe, but if you use something like this code, 2 million lines should be done in less than a minute (assuming an SSD, and reasonable memory speeds).",
    "My point is that you don't need the whole file in memory. You could type tweets.txt | python replacer.py > tweets_replaced.txt, and use the OS's native speeds to replace data line-by-line",
    "I will try this",
    "no, this is actually slower. My code takes 12mn using joblib.parallel and for line in f_in: f_out.write(re.sub(..., line)) takes 21mn. Concerning CPU and memory usage: CPU is same (17%) and memory much lower (60Mb) using files. But I want to minimize time spent, not memory usage.",
    "I moved this to chat because Whosebug suggested it",
    "I don't have experience with joblib. Could you try the same with Pandas? pandas.pydata.org/pandas-docs/…",
]

with open('tweets.txt', 'w') as fd:
    for i in range(2_000_000):
        print(choice(examples), file=fd)

（看看你能不能猜出我从哪里得到的台词！）

作为基准，我尝试使用天真的解决方案：

with open('tweets.txt') as fin, open('tweets2.txt', 'w') as fout:
    for l in fin:
        fout.write(replace_in(l))

这在我的 OSX 笔记本电脑上需要 14.0 秒（挂钟时间），在我的 Linux 台式机上需要 5.15 秒。请注意，将 replace_in 的定义更改为使用 do regex.sub(handler, tweet) 而不是 re.sub(regex, handler, tweet) 在我的笔记本电脑上将上述时间减少到 8.6s，但我不会在下面使用此更改。

然后我尝试了你的 joblib 包：

with open('tweets.txt') as fin, open('tweets2.txt', 'w') as fout:
    with Parallel(n_jobs=-1) as parallel:
        for l in parallel(delayed(replace_in)(tweet) for tweet in fin):
            fout.write(l)

在我的笔记本电脑上需要 1 分 16 秒，在我的台式机上需要 34.2 秒。 CPU 利用率非常低，因为 child/worker 任务大部分时间都在等待协调器将工作发送给它们。

然后我尝试使用 multiprocessing 包：

with open('tweets.txt') as fin, open('tweets2.txt', 'w') as fout:
    with Pool() as pool:
        for l in pool.map(replace_in, fin, chunksize=1024):
            fout.write(l)

在我的笔记本电脑上用了 5.95 秒，在台式机上用了 2.60 秒。我还尝试了大小为 8 的块，分别花费了 22.1 秒和 8.29 秒。块大小允许池将大量工作发送到其 children，因此它可以花费更少的时间进行协调，而将更多时间用于完成有用的工作。

因此我猜测 joblib 对于这种用法并不是特别有用，因为它 doesn't seem to have a notion of chunksize.

为什么这个并行搜索和替换没有使用 100% 的 CPU？

Why does this parallel search and replace does not use 100% of CPU?

python

regex

multiprocessing

python-3.x

joblib