sklearn FeatureHasher 输出有很多冲突和未使用的列

Question

我正在使用 sklearn.feature_extraction.FeatureHasher 为机器学习编码分类变量。我的分类变量是数字 ID，例如 1、2、3、6、18、19、20，...。我总共有 18,000 个唯一 ID，最高 ID 是 28,000。

我想对它们进行哈希处理以将它们编码为类别，因为 One-hot 编码不在画面中，因为它会创建 18,000 列，而我的数据集已经有 4,000,000 行，这会很痛苦。

我无法显示我的数据框，所以我在示例数据框上进行了说明。

from sklearn.feature_extraction import FeatureHasher
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

tmpdf = pd.DataFrame(np.arange(10000), columns=["test"])

生产

      test
0        0
1        1
2        2
3        3
4        4
...    ...
9995  9995
9996  9996
9997  9997
9998  9998
9999  9999

[10000 rows x 1 columns]

现在如果我用字典散列：

H = FeatureHasher(n_features=50, input_type="dict")

cdict = [{str(i) : 1} for i in range(10000)]
arr = H.transform(cdict)

plt.matshow(arr.toarray()[::100], cmap="tab10", vmin=-5, vmax=4)
plt.colorbar(shrink=0.4)
plt.show()

这会产生：

如果我用字符串散列：

H = FeatureHasher(n_features=50, input_type="string")

arr = H.transform(tmpdf["test"].astype(str))

plt.matshow(arr.toarray()[::100], cmap="tab10", vmin=-5, vmax=4)
plt.colorbar(shrink=0.4)
plt.show()

我得到：

问题：

为什么基于字符串的散列的输出看起来很奇怪？看起来很多冲突正在发生，因为 很多列仍然完全未使用 ... 我用错了吗？对于我对用户 ID 进行编码的任务，是否有更好的方法来完成？

Answer 1

正如 Ben Reiniger 评论的那样，问题在于给出字符串列表会使哈希器迭代字符串，例如字符串“12345”没有按原样进行哈希处理，而是对子字符串“1”、“2”、“3”、“4”、“5”进行了哈希处理。鉴于我只有数字，我只有 10 个唯一的字符串（数字“0”到“9”），因此会导致很多冲突

一个可能的解决方案是将字符串放入列表中，这样 Hasher 就不会遍历字符串本身，而是遍历列表，然后对整个字符串进行哈希处理。

obj = tmpdf["test"].astype(str).to_numpy()
obj = obj.reshape(*obj.shape, 1)

结果

array([['0'],
       ['1'],
       ['2'],
       ...,
       ['9997'],
       ['9998'],
       ['9999']], dtype=object)

哈希时：

H = FeatureHasher(n_features=50, input_type="string")

arr = H.transform(obj)

plt.matshow(arr.toarray()[::100], cmap="tab10", vmin=-5, vmax=4)
plt.colorbar(shrink=0.4)
plt.show()

生成的输出更符合我的喜好。

谢谢！

sklearn FeatureHasher 输出有很多冲突和未使用的列

sklearn FeatureHasher output has lots of collisions and unused columns

python

hash

scikit-learn

data-science

问题：