scikit-learn:每个 CV 折叠中的自定义特征选择

scikit-learn: Custom feature selection within each CV fold

我想对大型超参数网格进行随机搜索。我想优化的超参数之一是特征选择。 scikit-learn 为此提供了一些非常有用的功能,例如 RFECV class,但这并不与所有模型兼容,因为有些模型不公开 coef_feature_importances_属性。所以我想将 RFECV 与单变量特征选择进行比较。特别是,我想保留所有与我的因变量相关联的特征,这些特征在单变量分析中未校正 p < 0.05 时具有统计显着性。但是,我的数据建模策略相当复杂,因此不能选择使用现有的 scikit-learn class 之一,如 SelectKBestSelectFdr 来应用简单的单变量统计测试。同时,我对简单地预先计算整个数据集上的显着单变量关联持谨慎态度,因为这似乎混合了训练和测试数据。

我认为解决这个问题的最简单方法是预先计算每个交叉验证拆分中数据子集的显着单变量关联,然后实施从文本文件中读取这些的自定义特征选择函数。我从 this question 了解到,我可以创建一个自定义特征选择对象,该对象在其构造函数中采用交叉验证对象:

class ExternalSelector():
    """
    Univariate feature selection by reading pre-calculated results
    for each CV split. 
    """

    def __init__(self, cv):
        self.cv = cv
        self.feature_subset = None

    def transform(self, X, y=None, **kwargs):
        split_idx = 0
        for train_idxs, test_idxs in cv:
            # read the file

            # subset X

            split_idx = split_idx + 1

    def fit(self, X, y=None):
        return self

    def get_params(self):

...但是查看 sklearn 的 univariate feature selection source code,我无法弄清楚如何甚至是否有可能 return 每个拆分的 X 列表。

如何实现自定义特征选择功能,为每个交叉验证拆分读取不同的特征列表?

查看 GenericUnivariateSelect,它似乎很适合你的情况。

以下是如何在 CV 中使用它的示例:

from sklearn.feature_selection import GenericUnivariateSelect, f_classif
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline


X = np.array([[1, 1, 0],
             [1, 0, 0],
             [0, 1, 1],
             [0, 0, 1],
             [0, 0, 0]])

Y = np.array([1, 1, 0, 0, 1])

cv = KFold(5, random_state=1).split(X)
feature_selector = GenericUnivariateSelect(f_classif, 'fwe', 0.05) # select p-value threshold of 0.05
model = LogisticRegression(solver='lbfgs')

pipe = Pipeline([
    ('feature', feature_selector),
    ('logreg', model)
])

for i, (train_idx, test_idx) in enumerate(cv):
  pipe.fit(X[train_idx], Y[train_idx])
  score = pipe.score(X[test_idx], Y[test_idx])
  print("Feature selected for fold {} is {}".format(i, pipe.named_steps['feature']._get_support_mask()))

输出:

# Feature selected for fold 0 is [False False  True]
# Feature selected for fold 1 is [False False  True]
# Feature selected for fold 2 is [False False  True]
# Feature selected for fold 3 is [False False  True]
# Feature selected for fold 4 is [ True False  True]

您可以将 f_classif 替换为您自己的函数,这样它 returns scores 和 pvalues 所有功能

我最终找到了一个不会赢得任何风格点但对我的应用程序有效的解决方案。我保持拆分不变,并用不同的语言 (R) 预先计算了显着的单变量结果。然后,我编写了一个自定义特征选择函数,它根据当前训练拆分 (X) 中的观察(行)索引推断当前拆分的索引,给定整个数据集(Xall) .然后使用当前拆分的索引从文件中读取该特定拆分的预先计算的特征。

class PrecalculatedSelector():
    """
    Univariate feature selection by reading pre-calculated results
    for each split. 
    """

    def __init__(self, cv, Xall, yall):
        self.cv = cv
        self.Xall = Xall
        self.yall = yall
        self.features = None

    def transform(self, X, y=None, **kwargs):
        return X[self.features]

    def fit(self, X, y=None):
        # infer split index from sample indices
        samples = list(X.index)
        sample_idxs = [idx for idx, item in enumerate(self.Xall.index) if \
                       item in samples]
        counter = 0
        split_idx = -1
        for train_idxs, test_idxs in self.cv.split(self.Xall, self.yall):
            counter += 1
            if list(train_idxs) == sample_idxs:
                split_idx = counter
                break

        # read univariate results from file
        feature_dir = ...
        feature_file = feature_dir + "/split-{}.csv".\
            format(split_idx)
        with open(feature_file, 'r') as f:
            self.features = [line.strip() for line in f.readlines()]

        return self