对单热编码输出感到困惑

Question

困惑为什么最终输出是 [ 1., 0., 0., 1., 0., 0., 1., 0., 0.]？

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

Answer 1

这是你的训练数据：

  A    B    C    # <== feature names
  0    0    3
  1    1    0
  0    2    1
  1    0    2

现在如您在 enc.n_values_ 中所见：array([2, 3, 4])

第一个特征 A 有两个可能的值 = 0 和 1。类似地，特征 B 具有三个可能的值 = 0, 1, 2 ...

现在在输出中，每个特征将根据上述值分配列数。像这样：

A_0   A_1   B_0   B_1   B_2   C_0   C_1   C_2   C_3

此处A_0表示数据中存在0。因此，A_0 将为 1（热）而 A_1 将为 0。如果该数据中存在 1，则 A_1 将为 1（热）并且 A_0 将为为 0。

所以对于输入： A B C [0, 1, 1]

这里A=0，所以A_0为1，其余的A_1为0。对于 B，B=1，所以 B_1 将为 1，其他（B_0 和 B_2）将为 0。C 也一样。

所以最终输出是：

A_0   A_1   B_0   B_1   B_2   C_0   C_1   C_2   C_3
 1.,    0.,  0.,   1.,   0.,   0.,   1.,   0.,   0.

请参阅此以获取更多信息：http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features

对单热编码输出感到困惑

confused on one-hot encoding output

python

scikit-learn

one-hot-encoding