使用复杂的 if 语句在 Python 中进行更快的文本搜索

Question

我有大量长带有标点符号的文本文档。这里提供了三个简短的例子：

doc = ["My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you?", "My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you love dogs?", "My house, the most beautiful!, is NEAR the #sea. I really love holidays, do you?"]

我有如下几组词：

wAND = set(["house", "near"])
wOR = set(["seaside"])
wNOT = set(["dogs"])

我要搜索所有满足以下条件的文本文档：

(any(w in doc for w in wOR) or not wOR) and (all(w in doc for w in wAND) or not wAND) and (not any(w in doc for w in wNOT) or not wNOT)

需要每个括号中的 or not 条件，因为三个列表可能为空。请注意，在应用条件之前，我还需要从标点符号中清除文本，将其转换为小写，并将其拆分为一组单词，这需要额外的时间。

此过程将匹配 doc 中的第一个文本，但不匹配第二个和第三个。实际上，第二个不匹配，因为它包含单词“dogs”，而第三个不匹配，因为它不包含单词“seaside”。

我想知道是否可以更快地解决这个一般性问题（wOR、wAND 和 wNOT 列表中的词发生变化），从而避免文本预处理以进行清理。也许使用快速 regex 解决方案，可能使用 Trie()。那可能吗？或任何其他建议？

Answer 1

您的解决方案似乎与文档的长度呈线性关系 - 如果不进行排序，您将无法获得比这更好的解决方案，因为您要查找的词可能在文档中的任何位置。您可以尝试对整个文档使用一个循环：

or_satisfied = False
for w in doc:
    if word in wAND: wAND.remove(word)
    if not or_satisfied and word in wOR: or_satisfied = True
    if word in wNOT: return False
return or_satisfied and not wAND

Answer 2

您可以为您拥有的词袋构建正则表达式，并使用它们：

def make_re(word_set):
    return re.compile(
        r'\b(?:{})\b'.format('|'.join(re.escape(word) for word in word_set)),
        flags=re.I,
    )


wAND_re = make_re(wAND)
wOR_re = make_re(wOR)
wNOT_re = make_re(wNOT)

def re_match(doc):
    if not wOR_re.search(doc):
        return False
    if wNOT_re.search(doc):
        return False
    found = set()
    expected = len(wAND)
    for word in re.finditer(r'\w+', doc):
        found.add(word)
        if len(found) == expected:
            break
    return len(found) == expected

快速时间测试似乎表明这比原来的速度快 89%（并通过了原来的“测试套件”），~~很可能~~显然是因为

文档不需要清理（因为 \bs 限制匹配单词，re.I 处理大小写规范化）
正则表达式在本机代码中是运行，它往往总是比 Python

name='original'      iters=10000 time=0.206 iters_per_sec=48488.39
name='re_match'      iters=20000 time=0.218 iters_per_sec=91858.73
name='bag_match'     iters=10000 time=0.203 iters_per_sec=49363.58

其中 bag_match 是我对使用集合交集的原始评论建议：

def bag_match(doc):
    bag = set(clean_doc(doc))
    return (
        (bag.intersection(wOR) or not wOR) and
        (bag.issuperset(wAND) or not wAND) and
        (not bag.intersection(wNOT) or not wNOT)
    )

如果您已经将文档清理成可迭代的单词（这里我只是在 clean_doc 上打了 @lru_cache，这在现实生活中您可能不会这样做，因为您的文档可能都是唯一的，缓存也无济于事），那么 bag_match 会快得多：

name='orig-with-cached-clean-doc' iters=50000 time=0.249 iters_per_sec=200994.97
name='re_match-with-cached-clean-doc' iters=20000 time=0.221 iters_per_sec=90628.94
name='bag_match-with-cached-clean-doc' iters=100000 time=0.265 iters_per_sec=377983.60

使用复杂的 if 语句在 Python 中进行更快的文本搜索

Faster text search in Python with complex if statement

python

string

list