关键词多词时高效搜索关键词
search keywords efficiently when keywords are multi words
我需要使用 python 有效地匹配一个字符串中非常大的关键字列表 (>1000000)。我发现了一些非常好的库,它们试图快速做到这一点:
1) FlashText (https://github.com/vi3k6i5/flashtext)
2) Aho-Corasick算法等
但是我有一个特殊的要求:在我的上下文中,如果我的字符串是“XXXX 是 YYYY 的一个很好的指示”,关键字 say 'XXXX YYYY' 应该 return 匹配。请注意,'XXXX YYYY' 没有作为子字符串出现,但 XXXX 和 YYYY 出现在字符串中,这对我来说已经足够匹配了。
我天真地知道怎么做。我要的是效率,有没有更好的库?
这属于 "naive" 阵营,但这里有一个使用集合作为思想食粮的方法:
docs = [
""" Here's a sentence with dog and apple in it """,
""" Here's a sentence with dog and poodle in it """,
""" Here's a sentence with poodle and apple in it """,
""" Here's a dog with and apple and a poodle in it """,
""" Here's an apple with a dog to show that order is irrelevant """
]
query = ['dog', 'apple']
def get_similar(query, docs):
res = []
query_set = set(query)
for i in docs:
# if all n elements of query are in i, return i
if query_set & set(i.split(" ")) == query_set:
res.append(i)
return res
这个returns:
[" Here's a sentence with dog and apple in it ",
" Here's a dog with and apple and a poodle in it ",
" Here's an apple with a dog to show that order is irrelevant "]
当然,时间复杂度不是那么高,但由于执行 hash/set 操作的速度,总体上比使用列表要快得多。
第 2 部分是 Elasticsearch 如果您愿意付出努力并且要处理大量数据,那么 Elasticsearch 是一个很好的选择。
你问的听起来像 a full text search task. There's Python search package called whoosh。 @derek 的语料库可以在内存中进行索引和搜索,如下所示。
from whoosh.filedb.filestore import RamStorage
from whoosh.qparser import QueryParser
from whoosh import fields
texts = [
"Here's a sentence with dog and apple in it",
"Here's a sentence with dog and poodle in it",
"Here's a sentence with poodle and apple in it",
"Here's a dog with and apple and a poodle in it",
"Here's an apple with a dog to show that order is irrelevant"
]
schema = fields.Schema(text=fields.TEXT(stored=True))
storage = RamStorage()
index = storage.create_index(schema)
storage.open_index()
writer = index.writer()
for t in texts:
writer.add_document(text = t)
writer.commit()
query = QueryParser('text', schema).parse('dog apple')
results = index.searcher().search(query)
for r in results:
print(r)
这会产生:
<Hit {'text': "Here's a sentence with dog and apple in it"}>
<Hit {'text': "Here's a dog with and apple and a poodle in it"}>
<Hit {'text': "Here's an apple with a dog to show that order is irrelevant"}>
您还可以使用 FileStorage
来保留您的索引,如 How to index documents 中所述。
我需要使用 python 有效地匹配一个字符串中非常大的关键字列表 (>1000000)。我发现了一些非常好的库,它们试图快速做到这一点:
1) FlashText (https://github.com/vi3k6i5/flashtext)
2) Aho-Corasick算法等
但是我有一个特殊的要求:在我的上下文中,如果我的字符串是“XXXX 是 YYYY 的一个很好的指示”,关键字 say 'XXXX YYYY' 应该 return 匹配。请注意,'XXXX YYYY' 没有作为子字符串出现,但 XXXX 和 YYYY 出现在字符串中,这对我来说已经足够匹配了。
我天真地知道怎么做。我要的是效率,有没有更好的库?
这属于 "naive" 阵营,但这里有一个使用集合作为思想食粮的方法:
docs = [ """ Here's a sentence with dog and apple in it """, """ Here's a sentence with dog and poodle in it """, """ Here's a sentence with poodle and apple in it """, """ Here's a dog with and apple and a poodle in it """, """ Here's an apple with a dog to show that order is irrelevant """ ] query = ['dog', 'apple'] def get_similar(query, docs): res = [] query_set = set(query) for i in docs: # if all n elements of query are in i, return i if query_set & set(i.split(" ")) == query_set: res.append(i) return res
这个returns:
[" Here's a sentence with dog and apple in it ", " Here's a dog with and apple and a poodle in it ", " Here's an apple with a dog to show that order is irrelevant "]
当然,时间复杂度不是那么高,但由于执行 hash/set 操作的速度,总体上比使用列表要快得多。
第 2 部分是 Elasticsearch 如果您愿意付出努力并且要处理大量数据,那么 Elasticsearch 是一个很好的选择。
你问的听起来像 a full text search task. There's Python search package called whoosh。 @derek 的语料库可以在内存中进行索引和搜索,如下所示。
from whoosh.filedb.filestore import RamStorage
from whoosh.qparser import QueryParser
from whoosh import fields
texts = [
"Here's a sentence with dog and apple in it",
"Here's a sentence with dog and poodle in it",
"Here's a sentence with poodle and apple in it",
"Here's a dog with and apple and a poodle in it",
"Here's an apple with a dog to show that order is irrelevant"
]
schema = fields.Schema(text=fields.TEXT(stored=True))
storage = RamStorage()
index = storage.create_index(schema)
storage.open_index()
writer = index.writer()
for t in texts:
writer.add_document(text = t)
writer.commit()
query = QueryParser('text', schema).parse('dog apple')
results = index.searcher().search(query)
for r in results:
print(r)
这会产生:
<Hit {'text': "Here's a sentence with dog and apple in it"}>
<Hit {'text': "Here's a dog with and apple and a poodle in it"}>
<Hit {'text': "Here's an apple with a dog to show that order is irrelevant"}>
您还可以使用 FileStorage
来保留您的索引,如 How to index documents 中所述。