使用 OneHotEncoder 编码
Encoding with OneHotEncoder
我正在尝试使用 scikitlearn 的 OneHotEncoder 对数据进行预处理。显然,我做错了什么。这是我的示例程序:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
cat = ['ok', 'ko', 'maybe', 'maybe']
label_encoder = LabelEncoder()
label_encoder.fit(cat)
cat = label_encoder.transform(cat)
# returns [2 0 1 1], which seams good.
print(cat)
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
res = ct.fit_transform([cat])
print(res)
最终结果:[[1.0 0 1 1]]
预期结果:类似于:
[
[ 1 0 0 ]
[ 0 0 1 ]
[ 0 1 0 ]
[ 0 1 0 ]
]
有人可以指出我遗漏了什么吗?
可以考虑使用numpy和MultiLabelBinarizer。
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
cat = np.array([['ok', 'ko', 'maybe', 'maybe']])
m = MultiLabelBinarizer()
print(m.fit_transform(cat.T))
如果您仍想坚持使用您的解决方案。您只需要更新如下:
# because of it still a row, not a column
# res = ct.fit_transform([cat]) => remove this
# it should works
res = ct.fit_transform(np.array([cat]).T)
Out[2]:
array([[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[0., 1., 0.]])
我正在尝试使用 scikitlearn 的 OneHotEncoder 对数据进行预处理。显然,我做错了什么。这是我的示例程序:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
cat = ['ok', 'ko', 'maybe', 'maybe']
label_encoder = LabelEncoder()
label_encoder.fit(cat)
cat = label_encoder.transform(cat)
# returns [2 0 1 1], which seams good.
print(cat)
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
res = ct.fit_transform([cat])
print(res)
最终结果:[[1.0 0 1 1]]
预期结果:类似于:
[
[ 1 0 0 ]
[ 0 0 1 ]
[ 0 1 0 ]
[ 0 1 0 ]
]
有人可以指出我遗漏了什么吗?
可以考虑使用numpy和MultiLabelBinarizer。
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
cat = np.array([['ok', 'ko', 'maybe', 'maybe']])
m = MultiLabelBinarizer()
print(m.fit_transform(cat.T))
如果您仍想坚持使用您的解决方案。您只需要更新如下:
# because of it still a row, not a column
# res = ct.fit_transform([cat]) => remove this
# it should works
res = ct.fit_transform(np.array([cat]).T)
Out[2]:
array([[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[0., 1., 0.]])