字符串搜索算法——字符串匹配的复杂度

Question

我正在尝试解决 'String search algorithm' 但许多网站的答案似乎很复杂（ 'Naive string search' 和 O(m(n-m+1) )，下面我的算法有什么问题，它的最坏情况复杂度为 O(n)，而 KMP 也有 O(n) 因此我肯定是错的，但是在哪里呢？

def find(s1, s2):
    size = len(s1)
    index = 0 
    while ( index != len(s2)):
        if s2[index : index + size] == s1:
            print 'Pattern found at index %s'%(index)
            index += size
        else:
            index += 1

好的，所以我假设 s2[index : index + size] == s1 是 O(1)，也就是 O(n)，所以现在我原来的问题变成了，

为什么不计算和比较两个字符串的哈希值，如果两个哈希值相等，则字符串应该相等。
我不明白他们怎么会相撞。这不依赖于哈希算法。就像 MD5 已知的中断一样。

Answer 1

原问题

我认为您的代码的复杂度不是 O(n)，而是 O(mn)。此检查：s2[index : index + size] == s1，因为在最坏的情况下，它需要对字符进行 len(s1) 比较。

散列

这里是Wikipedia's definition of a hash function：

A hash function is any function that can be used to map data of arbitrary size to data of fixed size. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. One use is a data structure called a hash table, widely used in computer software for rapid data lookup.

这里我们运行进入这个方法的第一个问题。哈希函数接收任意大小的值，returns 接收固定大小的值。尽可能遵循pigeonhole principle, there is at least one hash with multiple values, probably more. As a quick example, imagine your hash function always produces an output that is one byte long. That means there are 256 possible outputs. After you've hashed 257 items, you'll always be certain there are at least 2 items with the same hash. To avoid this for as long as possible, a good hash function will map inputs over all possible outputs as uniformly。

因此，如果哈希值不相等，您可以确定字符串不相等，但反之则不然。 两个不同的字符串可以具有相同的哈希值。

字符串搜索算法——字符串匹配的复杂度

String searching algorithm - Complexity of string matching

python

string

algorithm

md5

原问题

散列