正则表达式替换需要花费数百万个文档的时间，如何让它更快？

Question

我有这样的文件：

documents = [
    "I work on c programing.",
    "I work on c coding.",
]

我有同义词文件，例如：

synonyms = {
    "c programing": "c programing",
    "c coding": "c programing"
}

我想替换我编写此代码的所有同义词：

# added code to pre-compile all regex to save compilation time. credits alec_djinn

compiled_dict = {}
for value in synonyms:
    compiled_dict[value] = re.compile(r'\b' + re.escape(value) + r'\b')

for doc in documents:
    document = doc
    for value in compiled_dict:
        lowercase = compiled_dict[value]
        document = lowercase.sub(synonyms[value], document)
    print(document)

输出：

I work on c programing.
I work on c programing.

但是由于文档的数量是几百万，同义词的数量是几万，所以这段代码完成的预计时间大约是 10 天。

有更快的方法吗？

PS：输出我想训练 word2vec 模型。

非常感谢任何帮助。我正在考虑编写一些 cpython 代码并将其放入并行线程中。

Answer 1

我要采取的步骤：

创建一个没有正则表达式的直接算法。也许事件直接根据同义词生成代码。
对文档的工作进行分区，以便您可以运行该算法直接在 N/x 文档上进行，并进行拆分以充分利用并行资源（例如，如果您有 4 个，则 x = 4核心）和运行使用 parallel approach（注意：避免使用线程）
如果您在多个节点上拥有资源（例如使用 spark），也许可以使用库来帮助运行并行执行此操作。

Answer 2

我会预编译所有正则表达式字符串并将它们放入字典中。通过这种方式，您可以避免反复编译相同的值。它将节省很多时间。

您的主循环将变为：

compiled_dict = {}
for value in synonyms:
        compiled_dict[value] = re.compile(r'\b' + re.escape(value) + r'\b')


for document in documents:
    for value in synonyms:
        lowercase = compiled_dict[value]
        document = lowercase.sub(synonyms[value], document)

Answer 3

我以前做过这样的字符串替换工作，也用于在非常大的文本语料库上训练 word2vec 模型。当要替换的项数（您的 "synonym terms"）非常大时，使用 Aho-Corasick algorithm instead of looping over many single string replacements. You can take a look at my fsed 实用程序（用 Python 编写）进行字符串替换是有意义的，这可能对你.

Answer 4

首先，您提供的代码不会在 c++. 中找到 c++，因为终止点与 \b 不匹配。您可以使用 (?!\w).

而不是 \b

假设大多数同义词都是单个词（没有空格和特殊字符），您可以通过查看文档中的每个实际词并在同义词列表中出现时替换它来进行一些优化。

然后可以像您一样处理所有剩余的同义词键（希望是少数），但首先要将键转换为它们的正则表达式等价物。

这是它的样子：

import re

# Get those synonyms that are not single words and turn them into regexes:
# Don't use \b to end a pattern; just require that no \w should follow 
complex_synonyms = [(r'\b' + re.escape(key) + r'(?!\w)', synonyms[key]) for key in synonyms if not re.match(r'[\w+]+$', key)]

for i, document in enumerate(documents):
    # Deal with the easy cases (words) in one go, by checking each word in the document
    document = re.sub(r'[\w+]+', lambda word: synonyms[word[0]] if word[0] in synonyms else word[0], document)
    # Replace the remaining synonyms by using regular expressions
    for find, repl in complex_synonyms:
        document = re.sub(find, repl, document)
    # Store the result back into the document
    documents[i] = document

正则表达式替换需要花费数百万个文档的时间，如何让它更快？

Regex replace is taking time for millions of documents, how to make it faster?

python

parallel-processing

cpython

word2vec