Sklearn 管道:值错误 - 预期的功能数量
Sklearn Pipelines: Value Error - Expected number of features
我创建了一个基本上循环遍历模型和缩放器并执行递归特征消除 (RFE) 的管道,如下所示:
def train_models(models, scalers, X_train, y_train, X_val, y_val):
best_results = {'f1_score': 0}
for model in models:
for scaler in scalers:
for n_features in list(range(
len(X_train.columns),
int(len(X_train.columns)/2),
-10
)):
rfe = RFE(
estimator=model,
n_features_to_select=n_features,
step=10
)
pipe = Pipeline([
('scaler', scaler),
('selector', rfe),
('model', model)
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_val)
results = evaluate(y_val, y_pred) #Returns a dictionary of values
results['pipeline'] = pipe
results['y_pred'] = y_pred
if results['f1_score'] > best_results['f1_score']:
best_results = results
print("Best F1: {}".format(best_results['f1_score']))
return best_results
管道在函数内部运行良好,能够正确预测结果并对其评分。
但是,当我在函数外调用 pipeline.predict() 时,例如
best_result = train_models(models, scalers, X_train, y_train, X_val, y_val)
pipeline = best_result['pipeline']
pipeline.predict(X_val)
我收到以下错误:
这是 pipeline
的样子:
Pipeline(steps=[('scaler', StandardScaler()),
('selector',
RFE(estimator=LogisticRegression(C=1, max_iter=1000,
penalty='l1',
solver='liblinear'),
n_features_to_select=78, step=10)),
('model',
LogisticRegression(C=1, max_iter=1000, penalty='l1',
solver='liblinear'))])
我猜管道中的 model
需要 48 个特征而不是 78 个,但我不明白数字 48 是从哪里来的,因为 n_features_to_select
在中设置为 78上一个 RFE 步骤!
如有任何帮助,我们将不胜感激!
我没有你的数据。但是根据您共享的信息进行一些计算和猜测,48 似乎是您的嵌套循环尝试的最后一个 n_features
。这让我怀疑罪魁祸首是浅拷贝。我建议您更改以下内容:
pipe = Pipeline([
('scaler', scaler),
('selector', rfe),
('model', model)
])
到
pipe = Pipeline([
('scaler', scaler),
('selector', rfe),
('model', copy.deepcopy(model))
])
然后再试一次(当然也是先做 import copy
)。
我创建了一个基本上循环遍历模型和缩放器并执行递归特征消除 (RFE) 的管道,如下所示:
def train_models(models, scalers, X_train, y_train, X_val, y_val):
best_results = {'f1_score': 0}
for model in models:
for scaler in scalers:
for n_features in list(range(
len(X_train.columns),
int(len(X_train.columns)/2),
-10
)):
rfe = RFE(
estimator=model,
n_features_to_select=n_features,
step=10
)
pipe = Pipeline([
('scaler', scaler),
('selector', rfe),
('model', model)
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_val)
results = evaluate(y_val, y_pred) #Returns a dictionary of values
results['pipeline'] = pipe
results['y_pred'] = y_pred
if results['f1_score'] > best_results['f1_score']:
best_results = results
print("Best F1: {}".format(best_results['f1_score']))
return best_results
管道在函数内部运行良好,能够正确预测结果并对其评分。
但是,当我在函数外调用 pipeline.predict() 时,例如
best_result = train_models(models, scalers, X_train, y_train, X_val, y_val)
pipeline = best_result['pipeline']
pipeline.predict(X_val)
我收到以下错误:
这是 pipeline
的样子:
Pipeline(steps=[('scaler', StandardScaler()),
('selector',
RFE(estimator=LogisticRegression(C=1, max_iter=1000,
penalty='l1',
solver='liblinear'),
n_features_to_select=78, step=10)),
('model',
LogisticRegression(C=1, max_iter=1000, penalty='l1',
solver='liblinear'))])
我猜管道中的 model
需要 48 个特征而不是 78 个,但我不明白数字 48 是从哪里来的,因为 n_features_to_select
在中设置为 78上一个 RFE 步骤!
如有任何帮助,我们将不胜感激!
我没有你的数据。但是根据您共享的信息进行一些计算和猜测,48 似乎是您的嵌套循环尝试的最后一个 n_features
。这让我怀疑罪魁祸首是浅拷贝。我建议您更改以下内容:
pipe = Pipeline([
('scaler', scaler),
('selector', rfe),
('model', model)
])
到
pipe = Pipeline([
('scaler', scaler),
('selector', rfe),
('model', copy.deepcopy(model))
])
然后再试一次(当然也是先做 import copy
)。