关键词多词时高效搜索关键词

search keywords efficiently when keywords are multi words

我需要使用 python 有效地匹配一个字符串中非常大的关键字列表 (>1000000)。我发现了一些非常好的库,它们试图快速做到这一点:

1) FlashText (https://github.com/vi3k6i5/flashtext)

2) Aho-Corasick算法等

但是我有一个特殊的要求:在我的上下文中,如果我的字符串是“XXXX 是 YYYY 的一个很好的指示”,关键字 say 'XXXX YYYY' 应该 return 匹配。请注意,'XXXX YYYY' 没有作为子字符串出现,但 XXXX 和 YYYY 出现在字符串中,这对我来说已经足够匹配了。

我天真地知道怎么做。我要的是效率,有没有更好的库?

这属于 "naive" 阵营,但这里有一个使用集合作为思想食粮的方法:

docs = [
    """ Here's a sentence with dog and apple in it """,
    """ Here's a sentence with dog and poodle in it """,
    """ Here's a sentence with poodle and apple in it """,
    """ Here's a dog with and apple and a poodle in it """,
    """ Here's an apple with a dog to show that order is irrelevant """
]

query = ['dog', 'apple']

def get_similar(query, docs):
    res = []
    query_set = set(query)
    for i in docs:
        # if all n elements of query are in i, return i
        if query_set & set(i.split(" ")) == query_set:
            res.append(i)
    return res

这个returns:

[" Here's a sentence with dog and apple in it ", 
" Here's a dog with and apple and a poodle in it ", 
" Here's an apple with a dog to show that order is irrelevant "]

当然,时间复杂度不是那么高,但由于执行 hash/set 操作的速度,总体上比使用列表要快得多。


第 2 部分是 Elasticsearch 如果您愿意付出努力并且要处理大量数据,那么 Elasticsearch 是一个很好的选择。

你问的听起来像 a full text search task. There's Python search package called whoosh。 @derek 的语料库可以在内存中进行索引和搜索,如下所示。

from whoosh.filedb.filestore import RamStorage
from whoosh.qparser import QueryParser
from whoosh import fields


texts = [
    "Here's a sentence with dog and apple in it",
    "Here's a sentence with dog and poodle in it",
    "Here's a sentence with poodle and apple in it",
    "Here's a dog with and apple and a poodle in it",
    "Here's an apple with a dog to show that order is irrelevant"
]

schema = fields.Schema(text=fields.TEXT(stored=True))
storage = RamStorage()
index = storage.create_index(schema)
storage.open_index()

writer = index.writer()
for t in texts:
    writer.add_document(text = t)
writer.commit()

query = QueryParser('text', schema).parse('dog apple')
results = index.searcher().search(query)

for r in results:
    print(r)

这会产生:

<Hit {'text': "Here's a sentence with dog and apple in it"}>
<Hit {'text': "Here's a dog with and apple and a poodle in it"}>
<Hit {'text': "Here's an apple with a dog to show that order is irrelevant"}>

您还可以使用 FileStorage 来保留您的索引,如 How to index documents 中所述。