在流水线 sklearn 中包含特征提取
Include feature extraction in pipeline sklearn
对于一个文本分类项目,我为特征选择和分类器制作了一个管道。现在我的问题是是否可以在管道中包含特征提取模块以及如何。我查了一些关于它的东西,但它似乎不适合我当前的代码。
这是我现在拥有的:
# feature_extraction module.
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_extraction import DictVectorizer
import numpy as np
vec = DictVectorizer()
X = vec.fit_transform(instances)
scaler = StandardScaler(with_mean=False) # we use cross validation, no train/test set
X_scaled = scaler.fit_transform(X) # To make sure everything is on the same scale
enc = LabelEncoder()
y = enc.fit_transform(labels)
# Feature selection and classification pipeline
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn import linear_model
from sklearn.pipeline import Pipeline
feat_sel = SelectKBest(mutual_info_classif, k=200)
clf = linear_model.LogisticRegression()
pipe = Pipeline([('mutual_info', feat_sel), ('logistregress', clf)]))
y_pred = model_selection.cross_val_predict(pipe, X_scaled, y, cv=10)
如何将 dictvectorizer 放入管道中的标签编码器?
以下是您的操作方法。假设 instances
是一个类似字典的对象,如 API 中所指定,那么只需像这样构建您的管道:
pipe = Pipeline([('vectorizer', DictVectorizer()),
('scaler', StandardScaler(with_mean=False)),
('mutual_info', feat_sel),
('logistregress', clf)])
进行预测,然后调用cross_val_predict
,传递instances
作为X:
y_pred = model_selection.cross_val_predict(pipe, instances, y, cv=10)
对于一个文本分类项目,我为特征选择和分类器制作了一个管道。现在我的问题是是否可以在管道中包含特征提取模块以及如何。我查了一些关于它的东西,但它似乎不适合我当前的代码。
这是我现在拥有的:
# feature_extraction module.
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_extraction import DictVectorizer
import numpy as np
vec = DictVectorizer()
X = vec.fit_transform(instances)
scaler = StandardScaler(with_mean=False) # we use cross validation, no train/test set
X_scaled = scaler.fit_transform(X) # To make sure everything is on the same scale
enc = LabelEncoder()
y = enc.fit_transform(labels)
# Feature selection and classification pipeline
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn import linear_model
from sklearn.pipeline import Pipeline
feat_sel = SelectKBest(mutual_info_classif, k=200)
clf = linear_model.LogisticRegression()
pipe = Pipeline([('mutual_info', feat_sel), ('logistregress', clf)]))
y_pred = model_selection.cross_val_predict(pipe, X_scaled, y, cv=10)
如何将 dictvectorizer 放入管道中的标签编码器?
以下是您的操作方法。假设 instances
是一个类似字典的对象,如 API 中所指定,那么只需像这样构建您的管道:
pipe = Pipeline([('vectorizer', DictVectorizer()),
('scaler', StandardScaler(with_mean=False)),
('mutual_info', feat_sel),
('logistregress', clf)])
进行预测,然后调用cross_val_predict
,传递instances
作为X:
y_pred = model_selection.cross_val_predict(pipe, instances, y, cv=10)