具有高基数的特征（如何向量化它们？）

Features with High Cardinality ( How to Vectorize them?)

我正在尝试运行使用 scikit 在数据集上学习的机器学习问题，其中一列（特征）具有大约 300K 的高基数 values.How 我是否对这样的特征进行矢量化。使用 DictVectorizer 不是解决方案，因为机器运行内存不足。

我在一些帖子中读到，我可以为所有这些字符串值分配数字，但会导致误导性结果。

有没有人处理过这种特征set.If那么，如何对其进行矢量化以便我可以将其传递给训练模型？

尝试FeatureHasher。它

is a low-memory alternative to DictVectorizer and CountVectorizer, intended for large-scale (online) learning and situations where memory is tight, e.g. when running prediction code on embedded devices.