best-found PCA 估计器用作 RFECV 中的估计器

Question

这有效（主要来自 sklearn 的演示示例）：

print(__doc__)


# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause


import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from scipy.stats import uniform

lregress = LinearRegression()

pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('regress', lregress)])


# Plot the PCA spectrum
pca.fit(data_num)

plt.figure(1, figsize=(16, 9))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_')

# Prediction
n_components = uniform.rvs(loc=1, scale=data_num.shape[1], size=50, 
random_state=42).astype(int)

# Parameters of pipelines can be set using ‘__’ separated parameter names:
estimator_pca = GridSearchCV(pipe,
                         dict(pca__n_components=n_components)
                        )
estimator_pca.fit(data_num, data_labels)

plt.axvline(estimator_pca.best_estimator_.named_steps['pca'].n_components,
            linestyle=':', label='n_components chosen ' + 
str(estimator_pca.best_estimator_.named_steps['pca'].n_components))
plt.legend(prop=dict(size=12))


plt.plot(np.cumsum(pca.explained_variance_ratio_), linewidth=1)

plt.show()

这有效：

from sklearn.feature_selection import RFECV


estimator = LinearRegression()
selector = RFECV(estimator, step=1, cv=5, scoring='explained_variance')
selector = selector.fit(data_num_pd, data_labels)
print("Selected number of features : %d" % selector.n_features_)

plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(selector.grid_scores_) + 1), selector.grid_scores_)
plt.show()

但这给了我错误 "RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes" on the line "selector1 = selector1.fit"

pca_est = estimator_pca.best_estimator_

selector1 = RFECV(pca_est, step=1, cv=5, scoring='explained_variance')
selector1 = selector1.fit(data_num_pd, data_labels)

print("Selected number of features : %d" % selector1.n_features_)

plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(selector1.grid_scores_) + 1), selector1.grid_scores_)
plt.show()

如何将我找到的最佳 PCA 估计器用作 RFECV 中的估计器？

Answer 1

这是管道设计中的一个已知问题。参考the github page here:

Accessing fitted attributes:

Moreover, some fitted attributes are used by meta-estimators; AdaBoostClassifier assumes its sub-estimator has a classes_ attribute after fitting, which means that presently Pipeline cannot be used as the sub-estimator of AdaBoostClassifier.

Either meta-estimators such as AdaBoostClassifier need to be configurable in how they access this attribute, or meta-estimators such as Pipeline need to make some fitted attributes of sub-estimators accessible.

其他属性也是如此，例如 coef_ 和 feature_importances_。它们是 last estimator 的一部分，因此不会被管道公开。

现在您可以尝试遵循此处的最后一段并尝试通过执行以下操作来绕过它以将其包含在管道中：

class Mypipeline(Pipeline):
    @property
    def coef_(self):
        return self._final_estimator.coef_
    @property
    def feature_importances_(self):
        return self._final_estimator.feature_importances_

然后在您的代码中使用这个新管道 class 而不是原来的 Pipeline。

这应该适用于大多数情况，但不适用于您的情况。您正在管道内使用 PCA 进行特征缩减。但是想使用 RFECV 进行特征选择。这在我看来不是一个好的组合。

RFECV 将继续减少要使用的特征数量。但是你从上面的网格搜索中选择的最佳 pca 中的 n_components 将被修复。然后当特征数量小于 n_components 时它会再次抛出错误。在那种情况下你无能为力。

所以我建议您考虑一下您的用例和代码。

best-found PCA 估计器用作 RFECV 中的估计器

best-found PCA estimator to be used as the estimator in RFECV

regression

feature-extraction

feature-selection

scikit-learn

Accessing fitted attributes: