缺失的分类数据应使用全零单热向量进行编码

Question

我正在使用非常稀疏标记的数据进行机器学习项目。有几个分类特征，导致特征之间大约有一百个不同类。

例如：

0    red
1    blue
2    <missing>

color_cat = pd.DataFrame(['red', 'blue', np.NAN])
color_enc = OneHotEncoder(sparse=True, handle_unknown='ignore')
color_one_hot = color_enc.fit_transform(color_cat)

在我将这些通过 scikit 的 OneHotEncoder 处理后，我希望将丢失的数据编码为 00，因为文档指出 handle_unknown='ignore' 导致编码器 return 全零数组。用 [SimpleImputer][1] 替换另一个值对我来说不是一个选项。

我的期望：

0    10
1    01
2    00

相反 OneHotEncoder 将缺失值视为另一个类别。

我得到的：

0    100
1    010
2    001

我看到了相关问题：但是解决方案对我不起作用。我明确要求零向量。

Answer 1

从未真正使用过稀疏矩阵，但一种方法是删除与您的 nan 值对应的列。从您的模型中获取 categories_ 并在它不是 nan 的地方创建一个布尔掩码（我使用 pd.Series.notna 但可能是其他方式）并创建一个新的（或重新分配）稀疏矩阵。基本上添加到您的代码中：

# currently you have
color_one_hot
# <3x3 sparse matrix of type '<class 'numpy.float64'>'
#   with 3 stored elements in Compressed Sparse Row format>

# line of code to add
new_color_one_hot = color_one_hot[:,pd.Series(color_enc.categories_[0]).notna().to_numpy()]

# and now you have
new_color_one_hot
# <3x2 sparse matrix of type '<class 'numpy.float64'>'
#   with 2 stored elements in Compressed Sparse Row format>

# and
new_color_one_hot.todense()
# matrix([[0., 1.],
#         [1., 0.],
#         [0., 0.]])

编辑：get_dummies 也给出了类似的结果 pd.get_dummies(color_cat[0], sparse=True)

编辑：仔细查看后，您可以在 OneHotEncoder 中指定参数 categories，所以如果您这样做：

color_cat = pd.DataFrame(['red', 'blue', np.nan])
color_enc = OneHotEncoder(categories=[color_cat[0].dropna().unique()],  ## here
                          sparse=True, handle_unknown='ignore')
color_one_hot = color_enc.fit_transform(color_cat)
color_one_hot.todense()
# matrix([[1., 0.],
#         [0., 1.],
#         [0., 0.]])

缺失的分类数据应使用全零单热向量进行编码

Missing categorical data should be encoded with an all-zero one-hot vector

python

machine-learning

pandas

scikit-learn

data-science