sklearn MultiLabelBinarizer() 的问题
problem with sklearn MultiLabelBinarizer()
从我的数据框 x_train 开始,我想单热列 'genres'。有超过 1000 种独特的不同类型,但是当我使用 multilabelbinarizer 函数时,它只报告 31 列,并且查看 class 它们真的没有意义,在帮助页面上查看它建议不要使用列表但是一个数组,就像我在示例中所做的那样,但仍然没有给我一个 36158 x 1388 矩阵。我错过了什么?
x_train:
movie_id year synopsis genres
0 30924 2005 Cruel But Necessary is the story of Betty Muns... Drama
1 34841 2012 Yorkshire, 1974, the Maynard family moves into... Drama Horror Thriller
2 23408 2017 When a renowned architecture scholar falls sud... Drama
3 39470 1996 The story dealt with Lord Rama and his retalia... Children Drama
4 7108 2003 A Thai playboy cons a girl into bed and then l... Comedy Drama Horror Thriller
... ... ... ... ...
x_train.shape:
(36518,5)
gen = np.array(x_train['genres'])
np.unique(gen).shape
(1388,)
from sklearn.preprocessing import MultiLabelBinarizer
multilabel_binarizer = MultiLabelBinarizer()
y=multilabel_binarizer.fit_transform(gen)
y.shape:
(36518, 31)
multilabel_binarizer.classes_:
array([' ', '-', 'A', 'C', 'D', 'F', 'H', 'I', 'M', 'N', 'R', 'S', 'T',
'W', 'X', 'a', 'c', 'd', 'e', 'h', 'i', 'l', 'm', 'n', 'o', 'r',
's', 't', 'u', 'v', 'y'], dtype=object)
奇怪的输出是由于 fit_transform()
的参数必须是可迭代对象 (see doc) 的可迭代对象。
必须更改变量 gen
的格式,以便将流派分开。将包含流派的字符串划分为一个字符串列表,以便将流派分开,如:
'Drama Horror Thriller' => ['Drama', 'Horror', 'Thriller']
一行即可完成:
gen = [x.split(' ') for x in list(x_train['genres'])]
gen
[['Drama'],
['Drama', 'Horror', 'Thriller'],
['Drama'],
['Children', 'Drama'],
['Comedy', 'Drama', 'Horror', 'Thriller']]
gen
现在具有 fit_transform()
的正确格式:
from sklearn.preprocessing import MultiLabelBinarizer
multilabel_binarizer = MultiLabelBinarizer()
y = multilabel_binarizer.fit_transform(gen)
multilabel_binarizer.classes_
['Children' 'Comedy' 'Drama' 'Horror' 'Thriller']
y
array([[0, 0, 1, 0, 0],
[0, 0, 1, 1, 1],
[0, 0, 1, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 1, 1, 1]])
从我的数据框 x_train 开始,我想单热列 'genres'。有超过 1000 种独特的不同类型,但是当我使用 multilabelbinarizer 函数时,它只报告 31 列,并且查看 class 它们真的没有意义,在帮助页面上查看它建议不要使用列表但是一个数组,就像我在示例中所做的那样,但仍然没有给我一个 36158 x 1388 矩阵。我错过了什么?
x_train:
movie_id year synopsis genres
0 30924 2005 Cruel But Necessary is the story of Betty Muns... Drama
1 34841 2012 Yorkshire, 1974, the Maynard family moves into... Drama Horror Thriller
2 23408 2017 When a renowned architecture scholar falls sud... Drama
3 39470 1996 The story dealt with Lord Rama and his retalia... Children Drama
4 7108 2003 A Thai playboy cons a girl into bed and then l... Comedy Drama Horror Thriller
... ... ... ... ...
x_train.shape:
(36518,5)
gen = np.array(x_train['genres'])
np.unique(gen).shape
(1388,)
from sklearn.preprocessing import MultiLabelBinarizer
multilabel_binarizer = MultiLabelBinarizer()
y=multilabel_binarizer.fit_transform(gen)
y.shape:
(36518, 31)
multilabel_binarizer.classes_:
array([' ', '-', 'A', 'C', 'D', 'F', 'H', 'I', 'M', 'N', 'R', 'S', 'T',
'W', 'X', 'a', 'c', 'd', 'e', 'h', 'i', 'l', 'm', 'n', 'o', 'r',
's', 't', 'u', 'v', 'y'], dtype=object)
奇怪的输出是由于 fit_transform()
的参数必须是可迭代对象 (see doc) 的可迭代对象。
必须更改变量 gen
的格式,以便将流派分开。将包含流派的字符串划分为一个字符串列表,以便将流派分开,如:
'Drama Horror Thriller' => ['Drama', 'Horror', 'Thriller']
一行即可完成:
gen = [x.split(' ') for x in list(x_train['genres'])]
gen
[['Drama'],
['Drama', 'Horror', 'Thriller'],
['Drama'],
['Children', 'Drama'],
['Comedy', 'Drama', 'Horror', 'Thriller']]
gen
现在具有 fit_transform()
的正确格式:
from sklearn.preprocessing import MultiLabelBinarizer
multilabel_binarizer = MultiLabelBinarizer()
y = multilabel_binarizer.fit_transform(gen)
multilabel_binarizer.classes_
['Children' 'Comedy' 'Drama' 'Horror' 'Thriller']
y
array([[0, 0, 1, 0, 0],
[0, 0, 1, 1, 1],
[0, 0, 1, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 1, 1, 1]])