scikit-learn:如何使用两个不同的数据集作为训练集和测试集
scikit-learn: how to use two different data sets as train and test sets
我正在尝试分别使用不同的数据集作为训练集和测试集。但是通过以下代码我得到:
File "main.py", line 84, in main_test
X2 = tf_transformer.transform(word_counts2)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 1020, in transform
n_features, expected_n_features))
ValueError: Input has n_features=1293 while the model has been trained with n_features=1625
def main_test(path = None):
dir_path = path or 'dataset'
files = sklearn.datasets.load_files(dir_path)
util.refine_all_emails(files.data)
word_counts = util.bagOfWords(files.data)
tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True)
tf_transformer.fit(word_counts)
X = tf_transformer.transform(word_counts)
dir_path = 'testset'
files2 = sklearn.datasets.load_files(dir_path)
util.refine_all_emails(files2.data)
word_counts2 = util.bagOfWords(files2.data)
# tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True)
# tf_transformer.fit(word_counts2)
X2 = tf_transformer.transform(word_counts2)
clf = sklearn.svm.LinearSVC()
test_classifier(X, files.target, clf, X2, files2.target, test_size=0.2, y_names=files.target_names, confusion=False)
def test_classifier(X, y, clf, X2, y2, test_size=0.4, y_names=None, confusion=False):
X_train, X_test, y_train, y_test = X, X2, y, y2
clf.fit(X_train, y_train)
# clf.fit(X_test, y_test)
y_predicted = clf.predict(X_test)
print colored('Classification report:', 'magenta', attrs=['bold'])
print sklearn.metrics.classification_report(y_test, y_predicted, target_names=y_names)
那是因为你打电话的时候
word_counts2 = util.bagOfWords(files2.data)
它使用 tfidf transformer 在训练集中从未见过的词生成结果,并且没有这些词的反频。
您只需要对出现在训练集中的单词进行计数,也许 CountVectorizer 会有所帮助。
我正在尝试分别使用不同的数据集作为训练集和测试集。但是通过以下代码我得到:
File "main.py", line 84, in main_test
X2 = tf_transformer.transform(word_counts2)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 1020, in transform
n_features, expected_n_features))
ValueError: Input has n_features=1293 while the model has been trained with n_features=1625
def main_test(path = None):
dir_path = path or 'dataset'
files = sklearn.datasets.load_files(dir_path)
util.refine_all_emails(files.data)
word_counts = util.bagOfWords(files.data)
tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True)
tf_transformer.fit(word_counts)
X = tf_transformer.transform(word_counts)
dir_path = 'testset'
files2 = sklearn.datasets.load_files(dir_path)
util.refine_all_emails(files2.data)
word_counts2 = util.bagOfWords(files2.data)
# tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True)
# tf_transformer.fit(word_counts2)
X2 = tf_transformer.transform(word_counts2)
clf = sklearn.svm.LinearSVC()
test_classifier(X, files.target, clf, X2, files2.target, test_size=0.2, y_names=files.target_names, confusion=False)
def test_classifier(X, y, clf, X2, y2, test_size=0.4, y_names=None, confusion=False):
X_train, X_test, y_train, y_test = X, X2, y, y2
clf.fit(X_train, y_train)
# clf.fit(X_test, y_test)
y_predicted = clf.predict(X_test)
print colored('Classification report:', 'magenta', attrs=['bold'])
print sklearn.metrics.classification_report(y_test, y_predicted, target_names=y_names)
那是因为你打电话的时候
word_counts2 = util.bagOfWords(files2.data)
它使用 tfidf transformer 在训练集中从未见过的词生成结果,并且没有这些词的反频。
您只需要对出现在训练集中的单词进行计数,也许 CountVectorizer 会有所帮助。