Scikit-learn SequentialFeatureSelector 输入包含 NaN、无穷大或对于 dtype ('float64') 来说太大的值。即使有管道

Question

我正在尝试使用 SequentialFeatureSelector，对于 estimator 参数，我正在向它传递一个包含输入缺失值步骤的管道：

model = Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('imputing',
                                                                   SimpleImputer(fill_value=-1,
                                                                                 strategy='constant')),
                                                                  ('preprocessing',
                                                                   StandardScaler())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x1300013d0>),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('imputing',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('encoding',
                                                                   OrdinalEncoder(handle_unknown='ignore'))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x1300015b0>)])),
                ('model',
                 LGBMClassifier(class_weight='balanced', random_state=1,
                                reg_lambda=0.1))])

尽管如此，当将其传递给选择器时它显示错误，这没有任何意义，因为我已经拟合并评估了我的模型并且它运行正常

fselector = SequentialFeatureSelector(estimator = model, scoring= "roc_auc", cv = 3, n_jobs= -1, ).fit(X, target)




 _assert_all_finite(X, allow_nan, msg_dtype)
        101                 not allow_nan and not np.isfinite(X).all()):
        102             type_err = 'infinity' if allow_nan else 'NaN, infinity'
    --> 103             raise ValueError(
        104                     msg_err.format
        105                     (type_err,
    
    ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

编辑：

可重现的例子：

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN

clf = Pipeline([("preprocessing", SimpleImputer(missing_values= np.NaN)),("model",LogisticRegression(random_state = 1))])                                                                        

SequentialFeatureSelector(estimator = clf,
                           scoring= "accuracy",
                           cv = 3).fit(X, y)

它显示相同的错误，尽管clf可以毫无问题地适应

Answer 1

ScikitLearn 的文档没有说明 SequentialFeatureSelector 与管道对象一起使用。只有 states class 接受未拟合的估计量。鉴于此，您可以从管道中删除 classifier，预处理 X，然后将其与未拟合的 classifier 一起传递以进行特征选择，如下例所示。

import numpy as np
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MaxAbsScaler


X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN

pipe = Pipeline([("preprocessing", SimpleImputer(missing_values= np.NaN)),
                ('scaler', MaxAbsScaler())])


# Preprocess your data
X = pipe.fit_transform(X)

# Run the SequentialFeatureSelector
sfs = SequentialFeatureSelector(estimator = LogisticRegression(),
                           scoring= "accuracy",
                           cv = 3).fit(X, y)

# Check which features are important and transform X
sfs.get_support()
X = sfs.transform(X)

Answer 2

您可以使用 mlxtend 包中的 SequentialFeatureSelection https://rasbt.github.io/mlxtend/

from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import numpy as np

X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN

clf = Pipeline([
    ("preprocessing", SimpleImputer(missing_values= np.NaN)),
    ("model",LogisticRegression(random_state = 1))
])

sfs = SequentialFeatureSelector(estimator = clf, 
                                forward = True, 
                                k_features = 'best', 
                                scoring = "accuracy", 
                                cv = 3, n_jobs=-1).fit(X, y)
sfs.k_feature_idx_

>>> (0, 1, 2, 3)

Scikit-learn SequentialFeatureSelector 输入包含 NaN、无穷大或对于 dtype ('float64') 来说太大的值。即使有管道

Scikit-learn SequentialFeatureSelector Input contains NaN, infinity or a value too large for dtype('float64'). even with pipeline

python

pipeline

scikit-learn