给定单词和频率对的字典,如何在 scikit 中进行文本挖掘
Given a dictionary of word and frequency pairs, how to proceed with text mining in scikit
我已经有了这样的词频和类别:
y = ['animals', 'restaurants', 'sports']
x = [{'cat':1, 'dog':2}, {'food':4, 'drink':2}, {'baseball':4, 'basketball':5}]
我应该如何按照教程构建管道,如下所示:
>>> from sklearn.pipeline import Pipeline
>>> text_clf = Pipeline([('vect', CountVectorizer()),
... ('tfidf', TfidfTransformer()),
... ('clf', MultinomialNB()),
... ])
>>> text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
CountVectorizer 需要一个字符串...我想我可以从字典中创建一个字符串并重复每个单词出现的次数?
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
如果您已经有了词频,请使用 DictVectorizer:
from sklearn.feature_extraction import DictVectorizer
pipeline = Pipeline([('dvect', DictVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
model = pipeline.fit(x, y)
那么你可以这样做:
>>> model.predict([{'cat':1}])[0]
'animals'
我已经有了这样的词频和类别:
y = ['animals', 'restaurants', 'sports']
x = [{'cat':1, 'dog':2}, {'food':4, 'drink':2}, {'baseball':4, 'basketball':5}]
我应该如何按照教程构建管道,如下所示:
>>> from sklearn.pipeline import Pipeline
>>> text_clf = Pipeline([('vect', CountVectorizer()),
... ('tfidf', TfidfTransformer()),
... ('clf', MultinomialNB()),
... ])
>>> text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
CountVectorizer 需要一个字符串...我想我可以从字典中创建一个字符串并重复每个单词出现的次数?
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
如果您已经有了词频,请使用 DictVectorizer:
from sklearn.feature_extraction import DictVectorizer
pipeline = Pipeline([('dvect', DictVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
model = pipeline.fit(x, y)
那么你可以这样做:
>>> model.predict([{'cat':1}])[0]
'animals'