NLP:为什么在 sklearn Pipeline 中使用两个矢量器(Bag of Words/TFIDF)?
NLP: Why use two vectorizers (Bag of Words/TFIDF) in sklearn Pipeline?
我正在尝试使用 sklearn
上的 SVC
解决文本分类问题。我还想检查哪种矢量化器最适合我的数据:Bag of Words CountVectorizer()
或 TF-IDF TfidfVectorizer()
到目前为止,我一直在做的是分别使用这两个矢量化器,一个接一个地使用,然后比较它们的结果。
# Bag of Words (BoW)
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
features_train_cv = count_vectorizer.fit_transform(features_train)
# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer()
features_train_tfidf = tfidf_vec.fit_transform(features_train)
# Instantiate SVC
classifier_linear = SVC(random_state=1, class_weight='balanced', kernel = "linear", C=1000)
# Fit SVC with BoW features
classifier_linear.fit(features_train_cv,target_train)
features_test_cv = count_vectorizer.transform(features_test)
target_test_pred_cv = classifier_linear.predict(features_test_cv)
# Confusion matrix: SVC with BoW features
from sklearn.metrics import confusion_matrix
print(confusion_matrix(target_test, target_test_pred_cv))
[[ 689 517]
[ 697 4890]]
# Fit SVC with TF-IDF features
classifier_linear.fit(features_train_tfidf,target_train)
features_test_tfidf = tfidf_vec.transform(features_test)
target_test_pred_tfidf = classifier_linear.predict(features_test_tfidf)
# Confusion matrix: SVC with TF-IDF features
[[ 701 505]
[ 673 4914]]
我认为使用 Pipeline
可能会使我的代码看起来更有条理。但我注意到,在 sklearn tutorial from the module official page 中建议的 Pipeline
代码中包含两个矢量化器:both CountVectorizer()
(词袋)and TfidfVectorizer()
# from sklearn official tutorial
from sklearn.pipeline import Pipeline
>>> text_clf = Pipeline([
... ('vect', CountVectorizer()),
... ('tfidf', TfidfTransformer()),
... ('clf', MultinomialNB()),
我的印象是,您只需要为您的特征选择一个矢量化器。这是否意味着数据被矢量化两次,一次使用简单词频然后使用 TF-IDF?
这段代码如何工作?
不是两个矢量化器。它是一个矢量化器 (CountVectorizer),后跟一个转换器 (TfidfTransformer)。您可以改用一个矢量化器 (TfidfVectorizer)。
TfidfVectorizer docs 注意 TfidfVectorizer 是:
Equivalent to CountVectorizer followed by TfidfTransformer.
我正在尝试使用 sklearn
上的 SVC
解决文本分类问题。我还想检查哪种矢量化器最适合我的数据:Bag of Words CountVectorizer()
或 TF-IDF TfidfVectorizer()
到目前为止,我一直在做的是分别使用这两个矢量化器,一个接一个地使用,然后比较它们的结果。
# Bag of Words (BoW)
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
features_train_cv = count_vectorizer.fit_transform(features_train)
# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer()
features_train_tfidf = tfidf_vec.fit_transform(features_train)
# Instantiate SVC
classifier_linear = SVC(random_state=1, class_weight='balanced', kernel = "linear", C=1000)
# Fit SVC with BoW features
classifier_linear.fit(features_train_cv,target_train)
features_test_cv = count_vectorizer.transform(features_test)
target_test_pred_cv = classifier_linear.predict(features_test_cv)
# Confusion matrix: SVC with BoW features
from sklearn.metrics import confusion_matrix
print(confusion_matrix(target_test, target_test_pred_cv))
[[ 689 517]
[ 697 4890]]
# Fit SVC with TF-IDF features
classifier_linear.fit(features_train_tfidf,target_train)
features_test_tfidf = tfidf_vec.transform(features_test)
target_test_pred_tfidf = classifier_linear.predict(features_test_tfidf)
# Confusion matrix: SVC with TF-IDF features
[[ 701 505]
[ 673 4914]]
我认为使用 Pipeline
可能会使我的代码看起来更有条理。但我注意到,在 sklearn tutorial from the module official page 中建议的 Pipeline
代码中包含两个矢量化器:both CountVectorizer()
(词袋)and TfidfVectorizer()
# from sklearn official tutorial
from sklearn.pipeline import Pipeline
>>> text_clf = Pipeline([
... ('vect', CountVectorizer()),
... ('tfidf', TfidfTransformer()),
... ('clf', MultinomialNB()),
我的印象是,您只需要为您的特征选择一个矢量化器。这是否意味着数据被矢量化两次,一次使用简单词频然后使用 TF-IDF?
这段代码如何工作?
不是两个矢量化器。它是一个矢量化器 (CountVectorizer),后跟一个转换器 (TfidfTransformer)。您可以改用一个矢量化器 (TfidfVectorizer)。
TfidfVectorizer docs 注意 TfidfVectorizer 是:
Equivalent to CountVectorizer followed by TfidfTransformer.