散列 python 中的不同元组给出相同的结果

Question

我正在处理整数矩阵集，我认为将它们表示为元组很有意义，因为它们是可散列的。然而 hash() 函数给了我奇怪的元组结果：

hash(((1, -1, 0), (1, 0, 0), (1, 0, -1)))

Out[147]: -697649482279922733

hash(((1, 0, -1), (1, 0, 0), (1, -1, 0)))

Out[148]: -697649482279922733

如您所见，这两个不同的元组具有相同的哈希值。请注意，它们实际上非常相似（交换第一个和最后一个子元组），但是我找不到更简单的示例：例如 ((0,1),(0,0)) 和 ((0,0),(0,1)) 具有不同的哈希值。

有什么线索吗？我简直不敢相信这真是运气太差了...现在我已经找到了问题所在，我可以轻松绕过它，但我认为无论如何都值得一提。

Answer 1

直到Python 3.8，元组的散列是基于使用以下公式（来自tuplehash() function）的内容的散列：

Py_uhash_t mult = _PyHASH_MULTIPLIER; /* defined as 1000003UL == 0xf4243 */
x = 0x345678UL;
p = v->ob_item;
while (--len >= 0) {
    y = PyObject_Hash(*p++);
    if (y == -1)
        return -1;
    x = (x ^ y) * mult;
    /* the cast might truncate len; that doesn't change hash stability */
    mult += (Py_hash_t)(82520UL + len + len);
}
x += 97531UL;
if (x == (Py_uhash_t)-1)
    x = -2;
return x;

这是一种称为 FNV-1 (Fowler / Noll / Vo) hash method 的方法。

碰巧，该公式为 (1, 0, -1) 和 (1, -1, 0):

产生完全相同的输出

>>> hash((1, -1, 0))
-2528505496374624146
>>> hash((1, 0, -1))
-2528505496374624146

因为包含的 3 个整数的哈希值是 1、0 和 -2:

>>> hash(1)
1
>>> hash(0)
0
>>> hash(-1)
-2

并且交换 0 和 -2 对结果没有实际影响。

因此，包含的 3 个元组的哈希值在两个示例之间没有变化，因此最终的哈希值也没有变化。

这只是一个巧合，我希望在实践中这种情况不会经常发生，而且字典和集合已经可以很好地处理冲突。

然而，在写下我最初的答案几年后，事实证明我的期望是错误的！上面的 tuplehash() 实施已经实施了 14 年，直到 someone insisted that there was a problem with the scheme。事实证明，某些值组合（例如4和-4，或0.25和0.5）大大减少了可能的散列该方法可以输出的值：

>>> import sys; from itertools import product
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=7, releaselevel='final', serial=0)
>>> values = (0.25, 0.5)
>>> sum(1 for _ in product(values, repeat=20))  # 20 elements in each tuple
1048576
>>> len(set(map(hash, product(values, repeat=20))))
32

以上创建了所有 1048576 (2 ** 20 == 1024 ** 2) 个可能的 20 值元组，这些元组组合了 0.25 和 0.5。理想情况下，它们都应该有不同的哈希值，或者至少有非常多的不同哈希值。但是上面的 tuplehash() 函数只产生了 32 个唯一值。这 32 个唯一哈希中的每一个都适用于 32768 (2 ** 15) 个这样的组合：

>>> from collections import Counter
>>> Counter(Counter(map(hash, product(values, repeat=20))).values())
Counter({32768: 32})

这其实是一个大的问题！上述问题也适用于 1, -1, 0，只是没有那么明显；在这里测试 3 ** 12 == 531441 组合：

>>> values = (1, -1, 0)
>>> sum(1 for _ in product(values, repeat=12))
531441
>>> len(set(map(hash, product(values, repeat=12))))
238605
>>> Counter(Counter(map(hash, product(values, repeat=12))).values())
Counter({1: 153005, 2: 51006, 4: 21730, 8: 8424, 16: 3012, 32: 994, 64: 314, 128: 92, 256: 20, 512: 6, 1024: 2})

所以为这些 12 元素元组生成的哈希中有 153005 个都使用单个哈希。

因此在 Python 3.8 中，implementation was switched from FNV-1a to an adaptation of the xxHash fast digest scheme. See the new tuplehash() function implementation 了解详情。

这个新方法对您问题中的示例表现出色：

>>> sys.version_info
sys.version_info(major=3, minor=8, micro=1, releaselevel='final', serial=0)
>>> hash((1, -1, 0))
426056430309831993
>>> hash((1, 0, -1))
-7823806182320511195
>>> hash(((1, -1, 0), (1, 0, 0), (1, 0, -1)))
-6252168277346219339
>>> hash(((1, 0, -1), (1, 0, 0), (1, -1, 0)))
-5221381175350594014

还有我上面讨论的病理情况：

>>> values = (0.25, 0.5)
>>> len(set(map(hash, product(values, repeat=20))))
1048576
>>> values = (1, -1, 0)
>>> len(set(map(hash, product(values, repeat=12))))
531441

Answer 2

看起来很奇怪，但不要使用任何一种方式 hash：https://docs.python.org/2/library/functions.html#hash

[hash is] used to quickly compare dictionary keys during a dictionary lookup.

它实际上并不是为通用散列而设计的——字典除了简单的散列相等性之外还有额外的检查。对于通用哈希，使用 hashlib

散列 python 中的不同元组给出相同的结果

hashing different tuples in python give identical result

python

hash

tuples