多 class 问题的单热编码 class 标签的正确方法

Question

我有多个 class 的 class 化问题，我们称它们为 A、B、C 和 D。我的数据具有以下形状：

X=[#samples, #features, 1], y=[#samples,1].

更具体地说，y 看起来像这样：

[['A'], ['B'], ['D'], ['A'], ['C'], ...]

当我在这些标签上训练随机森林 classifier 时，效果很好，但是我多次读到 class 标签也需要进行热编码。 one hot编码后，y为

[[1,0,0,0], [0,1,0,0], ...]

形状为

[#samples, 4]

当我尝试将其用作 classifier 输入时出现问题。该模型单独预测四个标签中的每一个，这意味着它也能够产生像 [0 0 0 0] 这样的输出，这是我不想要的。 rfc.classes_returns

# [array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1])]

我如何告诉模型标签是一个热编码的，而不是多个标签，它们应该相互独立预测？我需要更改我的 y 还是需要更改模型的某些设置？

Answer 1

在 sklearn 中使用随机森林时，您不必进行一次热编码。

你需要的是"label encoder"，你的Y应该是这样的

from sklearn.preprocessing import LabelEncoder
y = ["A","B","D","A","C"]
le = LabelEncoder()
le.fit_transform(y)
# array([0, 1, 3, 0, 2], dtype=int64)

我尝试修改示例代码 sklearn provided :

from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import make_classification

>>> X, y = make_classification(n_samples=1000, n_features=4,
...                            n_informative=2, n_redundant=0,
...                            random_state=0, shuffle=False)
y = np.random.choice(["A","B","C","D"],1000)
print(y.shape)
>>> clf = RandomForestClassifier(max_depth=2, random_state=0)
>>> clf.fit(X, y)
>>> clf.classes_
# array(['A', 'B', 'C', 'D'], dtype='<U1')

无论是否使用标签编码处理 y，它都适用于 RandomForestClassifier。

Answer 2

你原来的方法，没有单热编码，是做你想做的。

one-hot 编码适用于许多模型的输入，但只有少数模型的输出（例如训练具有交叉熵损失的神经网络）。所以只有一些算法实现需要这些，而其他算法没有它也可以做得很好。

对于输出标签，像 RandomForest 这样的分类器适用于字符串和多个类。

Answer 3

这里的标签不用编码，不管是字符串还是数字都可以保持原样据我所知，使用神经网络时，您应该考虑一种热编码/标签编码示例是在 bbc 分类数据的情况下

model.predict(sample_data)

array(['entertainment'], dtype='

对于训练集中的文本数据，必须进行一次热编码：例如

    name         fuel type

    baleno         petrol

    MG hector      electric

热编码后

  name         fuel type_petrol    fuel_type_electric


 baleno         1                       0


MG hector      0                       1

多 class 问题的单热编码 class 标签的正确方法

Correct way of one-hot-encoding class labels for multi-class problem

python

encoding

numpy

machine-learning

multiclass-classification