了解 FeatureHasher、碰撞和矢量大小权衡

Question

我在实施机器学习模型之前预处理我的数据。一些功能具有高基数，例如国家和语言。

由于将这些特征编码为单热向量会产生稀疏数据，我决定研究 the hashing trick 并像这样使用 python 的 category_encoders：

from category_encoders.hashing import HashingEncoder
ce_hash = HashingEncoder(cols = ['country'])
encoded = ce_hash.fit_transform(df.country)
encoded['country'] = df.country
encoded.head()

查看结果时，我可以看到碰撞

    col_0  col_1  col_2  col_3  col_4  col_5  col_6  col_7 country
0       0      0      1      0      0      0      0      0      US <━┓
1       0      1      0      0      0      0      0      0      CA.  ┃ US and SE collides 
2       0      0      1      0      0      0      0      0      SE <━┛
3       0      0      0      0      0      0      1      0      JP

进一步的调查让我找到 this Kaggle article。那里的散列示例包括 X 和 y。

y的作用是什么，对解决碰撞问题有帮助吗？
我是否应该向编码器添加更多列并将多个特征一起编码（例如国家和语言）？

将感谢有关如何使用散列技巧对此类类别进行编码的解释。

更新： 根据我从@CoMartel 那里得到的评论，我查看了 Sklearn FeatureHasher 并编写了以下代码来散列国家/地区列：

from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=10,input_type='string')
f = h.transform(df.country)
df1 = pd.DataFrame(f.toarray())
df1['country'] = df.country
df1.head()

并得到以下输出：

     0    1    2    3    4    5    6    7    8    9 country
0 -1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 -1.0  0.0      US
1 -1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 -1.0  0.0      US
2 -1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 -1.0  0.0      US
3  0.0 -1.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0      CA
4  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0 -1.0  0.0      SE
5  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0      JP
6 -1.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0      AU
7 -1.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0      AU
8 -1.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0      DK
9  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0 -1.0  0.0      SE

这是使用库的方式来编码高分类价值观？
为什么有些值是负数？
你会如何选择“正确的”n_features值？
如何查看碰撞率？

Answer 1

Is that the way to use the library in order to encode high categorical values?

是的。您的实施没有任何问题。

您可以将散列技巧视为一种“减少碰撞风险的单热编码，如果您可以容忍原始特征维度，则无需使用它".

这个想法最初是由 Kilian Weinberger 提出的。你可以在他们的论文中找到算法的整个理论分析和 practically/empirically.

Why are some values negative?

为避免冲突，使用了 signed 哈希函数。也就是说，首先使用通常的 hash function 对字符串进行散列处理（例如，通过对每个字符的 ASCII 值求和，然后对 n_feature 取模，将字符串转换为相应的数值，然后取模 n_feature 以在 (0 ,n_features]).然后又一个单位输出散列函数.后者产生+1或-1 根据定义，它被添加到第一个哈希函数产生的索引中。

伪代码（虽然看起来像 Python）：

def hash_trick(features, n_features):
     for f in features:
         res = np.zero_like(features)
         h = usual_hash_function(f) # just the usual hashing
         index = h % n_features  # find the modulo to get index to place f in res
         if single_bit_hash_function(f) == 1:  # to reduce collision
             res[index] += 1
         else:
             res[index] -= 1 # <--- this will make values to become negative

     return res

How would you choose the "right" n_features value?

根据经验，您可以猜到，如果我们散列一个额外的特征（即#n_feature + 1），冲突肯定会发生。因此，最好的情况是每个特征都映射到一个唯一的哈希值——希望如此。在这种情况下，从逻辑上讲，n_features应该至少等于features/categories的实际数量（在你的具体情况，不同国家的数量）。尽管如此，请记住这是“最佳”情况，“从数学上讲”并非如此。所以，当然是越高越好，但是多高呢？看下一个。

How can I check the collisions ratio?

如果我们忽略第二个单比特散列函数，问题就简化为“散列的生日问题”。

这是一个很大的话题。对于这个问题的全面介绍，我建议你阅读this, and for some detailed math, I recommend this答案。

简而言之，您需要知道的是，不发生碰撞的概率是 exp(-1/2) = 60.65%，这意味着至少有大约 39.35% 发生一次碰撞的可能性。

因此，根据经验，如果我们有 X 个国家/地区，则如果散列函数输出范围（即 n_feature参数）是X^2。换句话说，如果您示例中的国家/地区数量 = square_root(n_features)，则有 40% 发生碰撞的可能性。当你以指数方式增加 n_features 时，碰撞的机会减少了一半。（个人认为，如果不是为了安全起见，只是单纯的字符串到数字的转换，不值得去太高）。

好奇读者的旁注：对于足够大的哈希函数输出大小（例如 256 位），攻击者猜测（或利用）冲突的机会几乎是不可能的（从安全角度来看）。

关于 y 参数，正如您已经在评论中看到的那样，它只是出于兼容性目的，未被使用（scikit-learn 在许多其他实现中都有这个）。

了解 FeatureHasher、碰撞和矢量大小权衡

Understanding FeatureHasher, collisions and vector size trade-off

python

machine-learning

data-science