关键词多词时高效搜索关键词

Question

我需要使用 python 有效地匹配一个字符串中非常大的关键字列表 (>1000000)。我发现了一些非常好的库，它们试图快速做到这一点：

1) FlashText (https://github.com/vi3k6i5/flashtext)

2) Aho-Corasick算法等

但是我有一个特殊的要求：在我的上下文中，如果我的字符串是“XXXX 是 YYYY 的一个很好的指示”，关键字 say 'XXXX YYYY' 应该 return 匹配。请注意，'XXXX YYYY' 没有作为子字符串出现，但 XXXX 和 YYYY 出现在字符串中，这对我来说已经足够匹配了。

我天真地知道怎么做。我要的是效率，有没有更好的库？

Answer 1

这属于 "naive" 阵营，但这里有一个使用集合作为思想食粮的方法：

docs = [
    """ Here's a sentence with dog and apple in it """,
    """ Here's a sentence with dog and poodle in it """,
    """ Here's a sentence with poodle and apple in it """,
    """ Here's a dog with and apple and a poodle in it """,
    """ Here's an apple with a dog to show that order is irrelevant """
]

query = ['dog', 'apple']

def get_similar(query, docs):
    res = []
    query_set = set(query)
    for i in docs:
        # if all n elements of query are in i, return i
        if query_set & set(i.split(" ")) == query_set:
            res.append(i)
    return res

这个returns:

[" Here's a sentence with dog and apple in it ", 
" Here's a dog with and apple and a poodle in it ", 
" Here's an apple with a dog to show that order is irrelevant "]

当然，时间复杂度不是那么高，但由于执行 hash/set 操作的速度，总体上比使用列表要快得多。

第 2 部分是 Elasticsearch 如果您愿意付出努力并且要处理大量数据，那么 Elasticsearch 是一个很好的选择。

Answer 2

你问的听起来像 a full text search task. There's Python search package called whoosh。 @derek 的语料库可以在内存中进行索引和搜索，如下所示。

from whoosh.filedb.filestore import RamStorage
from whoosh.qparser import QueryParser
from whoosh import fields


texts = [
    "Here's a sentence with dog and apple in it",
    "Here's a sentence with dog and poodle in it",
    "Here's a sentence with poodle and apple in it",
    "Here's a dog with and apple and a poodle in it",
    "Here's an apple with a dog to show that order is irrelevant"
]

schema = fields.Schema(text=fields.TEXT(stored=True))
storage = RamStorage()
index = storage.create_index(schema)
storage.open_index()

writer = index.writer()
for t in texts:
    writer.add_document(text = t)
writer.commit()

query = QueryParser('text', schema).parse('dog apple')
results = index.searcher().search(query)

for r in results:
    print(r)

这会产生：

<Hit {'text': "Here's a sentence with dog and apple in it"}>
<Hit {'text': "Here's a dog with and apple and a poodle in it"}>
<Hit {'text': "Here's an apple with a dog to show that order is irrelevant"}>

您还可以使用 FileStorage 来保留您的索引，如 How to index documents 中所述。

关键词多词时高效搜索关键词

search keywords efficiently when keywords are multi words

python

string

pattern-matching

string-matching

keyword-search