输入长度不匹配 scikit
Input length mismatch scikit
我正在尝试使用 DecisionTreeClassifier
进行一些分析,但出现以下错误:
ValueError: Number of features of the model must match the input. Model n_features is 1 and input n_features is 4
我对 SVC
和 GaussianNB
分类器使用了相同的训练集和测试集,它们都运行良好。下面是我的代码,我知道测试集和训练集具有相同的设计,也就是说,在进行矢量化之前,它们采用包含字符串的列表形式。我不知道哪里不匹配
#classify with just scikit
from sklearn.feature_extraction.text import TfidfVectorizer
from tools.striper import stripe, cleanupfiles
from tools.tweetprocessor import clean, wordclean
from sklearn import svm
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import classification_report
from sklearn import tree
stripe(0.1)
training = []
traininglabel = []
test = []
testlabel = []
with open('tempdata/goodtraining.txt','r') as f:
for line in f:
tweet = [wordclean(x) for x in clean(line.rstrip('\n')).split()]
tweet = [x for x in tweet if len(x) >= 3]
training.append(' '.join(tweet))
traininglabel.append('good')
with open('tempdata/badtraining.txt','r') as f:
for line in f:
tweet = [wordclean(x) for x in clean(line.rstrip('\n')).split()]
tweet = [x for x in tweet if len(x) >= 3]
training.append(' '.join(tweet))
traininglabel.append('bad')
with open('tempdata/goodtest.txt','r') as f:
for line in f:
tweet = [wordclean(x) for x in clean(line.rstrip('\n')).split()]
test.append(' '.join(tweet))
testlabel.append('good')
with open('tempdata/badtest.txt','r') as f:
for line in f:
tweet = [wordclean(x) for x in clean(line.rstrip('\n')).split()]
test.append(' '.join(tweet))
testlabel.append('bad')
vectorizer = TfidfVectorizer(min_df=0.1,max_df=0.9)
train_vect = vectorizer.fit_transform(training)
test_vect = vectorizer.fit_transform(test)
print (train_vect)
print (test_vect)
classifier = tree.DecisionTreeClassifier()
classifier.fit(train_vect.toarray(), traininglabel)
predictions = classifier.predict(test_vect.toarray())
print (classification_report(testlabel, predictions))
cleanupfiles()
你需要改变
test_vect = vectorizer.fit_transform(test)
至
test_vect = vectorizer.transform(test)
矢量化器的 fit()
方法只能在训练数据上调用。
我正在尝试使用 DecisionTreeClassifier
进行一些分析,但出现以下错误:
ValueError: Number of features of the model must match the input. Model n_features is 1 and input n_features is 4
我对 SVC
和 GaussianNB
分类器使用了相同的训练集和测试集,它们都运行良好。下面是我的代码,我知道测试集和训练集具有相同的设计,也就是说,在进行矢量化之前,它们采用包含字符串的列表形式。我不知道哪里不匹配
#classify with just scikit
from sklearn.feature_extraction.text import TfidfVectorizer
from tools.striper import stripe, cleanupfiles
from tools.tweetprocessor import clean, wordclean
from sklearn import svm
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import classification_report
from sklearn import tree
stripe(0.1)
training = []
traininglabel = []
test = []
testlabel = []
with open('tempdata/goodtraining.txt','r') as f:
for line in f:
tweet = [wordclean(x) for x in clean(line.rstrip('\n')).split()]
tweet = [x for x in tweet if len(x) >= 3]
training.append(' '.join(tweet))
traininglabel.append('good')
with open('tempdata/badtraining.txt','r') as f:
for line in f:
tweet = [wordclean(x) for x in clean(line.rstrip('\n')).split()]
tweet = [x for x in tweet if len(x) >= 3]
training.append(' '.join(tweet))
traininglabel.append('bad')
with open('tempdata/goodtest.txt','r') as f:
for line in f:
tweet = [wordclean(x) for x in clean(line.rstrip('\n')).split()]
test.append(' '.join(tweet))
testlabel.append('good')
with open('tempdata/badtest.txt','r') as f:
for line in f:
tweet = [wordclean(x) for x in clean(line.rstrip('\n')).split()]
test.append(' '.join(tweet))
testlabel.append('bad')
vectorizer = TfidfVectorizer(min_df=0.1,max_df=0.9)
train_vect = vectorizer.fit_transform(training)
test_vect = vectorizer.fit_transform(test)
print (train_vect)
print (test_vect)
classifier = tree.DecisionTreeClassifier()
classifier.fit(train_vect.toarray(), traininglabel)
predictions = classifier.predict(test_vect.toarray())
print (classification_report(testlabel, predictions))
cleanupfiles()
你需要改变
test_vect = vectorizer.fit_transform(test)
至
test_vect = vectorizer.transform(test)
矢量化器的 fit()
方法只能在训练数据上调用。