从大字典匹配子字符串的最快方法

Question

我有一些（通常 < 300 个符号长度）字符串，例如 'aabbccdcabcbbacdaaa'。

有 python 字典，其中键是格式相似的字符串，例如'bcccd'，密钥长度从 10 到 100 个符号不等。字典有 50 万 个条目。

我需要将我的初始字符串与字典的值进行匹配，或者发现字典中没有合适的值。匹配条件：字典键必须在字符串中（严格匹配）。

就计算速度而言，最好的方法是什么？我觉得应该有一些棘手的方法来散列我的初始字符串和字典键，以应用一些聪明的子字符串搜索方法（如 Rabin-Karp 或 Knuth-Morris-Pratt）。或者后缀树状结构可能是一个很好的解决方案？

Answer 1

您可以使用以下格式：

for key in your_dictionary:
    if key in your_string:
        print(key+' is in both your string and the dictionary. It has the value '+str(your_dictionary[key]))

如果您希望以任何方式对此进行更改，请在评论中告诉我，我很乐意更新。

Answer 2

def search(string, dict_search):
    # If those 2 lines are too expensive, calculate them and pass as arguments
    max_key = max(len(x) for x in dict_search)
    min_key = min(len(x) for x in dict_search)

    return set(
        string[x:x+i] 
        for i in range(min_key, max_key+1)
        for x in range(len(string)-i+1)
        if string[x:x+i] in dict_search
    )

运行:

>>> search('aabbccdcabcbbacdaaa', {'aaa', 'acd', 'adb', 'bccd', 'cbbb', 'abc'})
{'aaa', 'abc', 'acd', 'bccd'}

Answer 3

刚刚为 Python - pyahocorasick 找到了 Aho-Corasick 的合理实现。取自页面末尾的示例：

import ahocorasick
A = ahocorasick.Automaton()

for k, v in your_big_dict.iteritems():
    A.add_word(k, v)

A.make_automaton()
for item in A.iter(your_long_string):
    print(item)

从大字典匹配子字符串的最快方法

Fastest way to match substring from large dict

python

algorithm

search

text

substring