有效地构建具有给定汉明距离的单词图

Question

我想根据 Hamming distance 为（比方说）1 的单词列表构建一个图表，或者换句话说，如果两个单词仅与一个字母不同（lol -> lot).

所以给定

words = [ lol, lot, bot ]

图表将是

{
  'lol' : [ 'lot' ],
  'lot' : [ 'lol', 'bot' ],
  'bot' : [ 'lot' ]
}

简单的方法是将列表中的每个单词相互比较并计算不同的字符；遗憾的是，这是一个 O(N^2) 算法。

我可以使用哪个 algo/ds/strategy 来获得更好的性能？

此外，我们假设只有拉丁字符，并且所有单词的长度都相同。

Answer 1

假设您将字典存储在 set() 中，因此 lookup is O(1) in the average (worst case O(n)).

您可以从一个单词生成汉明距离为 1 的所有有效单词：

>>> def neighbours(word):
...     for j in range(len(word)):
...         for d in string.ascii_lowercase:
...             word1 = ''.join(d if i==j else c for i,c in enumerate(word))
...             if word1 != word and word1 in words: yield word1
...
>>> {word: list(neighbours(word)) for word in words}
{'bot': ['lot'], 'lol': ['lot'], 'lot': ['bot', 'lol']}

如果M是单词的长度，L是字母的长度（即26），则最坏情况用这种方法找到相邻词的时间复杂度是O(L*M*N).

"easy way" 方法的时间复杂度是 O(N^2)。

这种方法什么时候更好？当L*M < N，即如果只考虑小写字母，当M < N/26。（我这里只考虑了最坏的情况）

注：the average length of an english word is 5.1 letters。因此，如果您的词典大小超过 132 个单词，您应该考虑这种方法。

可能会获得比这更好的性能。然而，这实现起来真的很简单。

实验基准：

"easy way"算法（A1）：

from itertools import zip_longest
def hammingdist(w1,w2): return sum(1 if c1!=c2 else 0 for c1,c2 in zip_longest(w1,w2))
def graph1(words): return {word: [n for n in words if hammingdist(word,n) == 1] for word in words}

这个算法（A2）：

def graph2(words): return {word: list(neighbours(word)) for word in words}

基准代码：

for dict_size in range(100,6000,100):
    words = set([''.join(random.choice(string.ascii_lowercase) for x in range(3)) for _ in range(dict_size)])
    t1 = Timer(lambda: graph1()).timeit(10)
    t2 = Timer(lambda: graph2()).timeit(10)
    print('%d,%f,%f' % (dict_size,t1,t2))

输出：

100,0.119276,0.136940
200,0.459325,0.233766
300,0.958735,0.325848
400,1.706914,0.446965
500,2.744136,0.545569
600,3.748029,0.682245
700,5.443656,0.773449
800,6.773326,0.874296
900,8.535195,0.996929
1000,10.445875,1.126241
1100,12.510936,1.179570
...

我运行另一个基准测试，步长更小，N 更近：

10,0.002243,0.026343
20,0.010982,0.070572
30,0.023949,0.073169
40,0.035697,0.090908
50,0.057658,0.114725
60,0.079863,0.135462
70,0.107428,0.159410
80,0.142211,0.176512
90,0.182526,0.210243
100,0.217721,0.218544
110,0.268710,0.256711
120,0.334201,0.268040
130,0.383052,0.291999
140,0.427078,0.312975
150,0.501833,0.338531
160,0.637434,0.355136
170,0.635296,0.369626
180,0.698631,0.400146
190,0.904568,0.444710
200,1.024610,0.486549
210,1.008412,0.459280
220,1.056356,0.501408
...

您会看到权衡非常低（长度为 3 的单词词典为 100）。对于小型词典，O(N^2) 算法的性能 稍好 [=72=]，但随着 N 的增长，O(LMN) 算法很容易击败它。

对于较长单词的词典，O(LMN) 算法在 N 中保持线性，只是斜率不同，因此权衡稍微向右移动（长度 = 5 时为 130）。

Answer 2

这里是线性O(N) 算法，但是常数因子很大(R * L * 2)。 R 是基数（拉丁字母表是 26）。 L是一个中等长度的词。 2 是 adding/replacing 通配符的因数。所以 abc 和 aac 和 abca 是导致汉明距离为 1 的两个操作。

写在Ruby。对于 240k 字，平均硬件需要 ~250Mb RAM 和 136 秒

图实现蓝图

class Node
  attr_reader :val, :edges

  def initialize(val)
    @val = val
    @edges = {}
  end

  def <<(node)
    @edges[node.val] ||= true
  end

  def connected?(node)
    @edges[node.val]
  end

  def inspect
    "Val: #{@val}, edges: #{@edges.keys * ', '}"
  end
end

class Graph
  attr_reader :vertices
  def initialize
    @vertices = {}
  end

  def <<(val)
    @vertices[val] = Node.new(val)
  end

  def connect(node1, node2)
    # print "connecting #{size} #{node1.val}, #{node2.val}\r"
    node1 << node2
    node2 << node1
  end

  def each
    @vertices.each do |val, node|
      yield [val, node]
    end
  end

  def get(val)
    @vertices[val]
  end
end

算法本身

CHARACTERS = ('a'..'z').to_a
graph = Graph.new

# ~ 240 000 words
File.read("/usr/share/dict/words").each_line.each do |word|
  word = word.chomp
  graph << word.downcase
end

graph.each do |val, node|
  CHARACTERS.each do |char|
    i = 0
    while i <= val.size
      node2 = graph.get(val[0, i] + char + val[i..-1])
      graph.connect(node, node2) if node2
      if i < val.size
        node2 = graph.get(val[0, i] + char + val[i+1..-1])
        graph.connect(node, node2) if node2
      end
      i += 1
    end
  end
end

Answer 3

无需依赖字母大小。例如，给定一个单词 bot，将其插入关键字 ?ot, b?t, bo? 下的单词列表字典中。然后，对于每个单词列表，连接所有对。

import collections


d = collections.defaultdict(list)
with open('/usr/share/dict/words') as f:
    for line in f:
        for word in line.split():
            if len(word) == 6:
                for i in range(len(word)):
                    d[word[:i] + ' ' + word[i + 1:]].append(word)
pairs = [(word1, word2) for s in d.values() for word1 in s for word2 in s if word1 < word2]
print(len(pairs))

Answer 4

Ternary Search Trie 很好地支持近邻搜索。

如果您的字典存储为 TST，那么我相信，在构建图表时查找的平均复杂度将接近 O(N*log(N)) 在现实世界中的单词字典中。

并检查 Efficient auto-complete with a ternary search tree article。

有效地构建具有给定汉明距离的单词图

Efficiently build a graph of words with given Hamming distance

python

algorithm

hamming-distance

graph-algorithm

实验基准：