使用特征联合时如何确保不相交的特征集

Question

我正在尝试学习如何使用 sklearn 中的一些辅助功能，但我正在努力理解如何使用 FeatureUnion

文档的一部分说明了这一点

(A FeatureUnion has no way of checking whether two transformers might produce identical features. It only produces a union when the feature sets are disjoint, and making sure they are is the caller’s responsibility.)

然而 Iris 数据集上的示例显示了这一点

X, y = iris.data, iris.target

# This dataset is way to high-dimensional. Better do PCA:
pca = PCA(n_components=2)

# Maybe some original features where good, too?
selection = SelectKBest(k=1)

# Build estimator from PCA and Univariate selection:

combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)

如何确保 pca 和 SelectKBest 函数不 select 相同的特征，或者换句话说，用户如何确保两个 select 离子不相交？

http://scikit-learn.org/dev/modules/pipeline.html#feature-union

http://scikit-learn.org/stable/auto_examples/feature_stacker.html#example-feature-stacker-py

Answer 1

我认为您用文档中的那句话几乎回答了您自己的问题：

(A FeatureUnion has no way of checking whether two transformers might produce identical features. It only produces a union when the feature sets are disjoint, and making sure they are is the caller’s responsibility.)

FeatureUnion不保证功能不同。

在 Iris 数据集的示例中，PCA 和特征选择过程有可能（尽管不太可能）生成相同的特征。在这种情况下，FeatureUnion 的输出中只有两倍的相同特征。

这通常没什么大不了的，但如果你能避免它，这样做可能会更干净（例如，随机森林模型会偏向于以下特征出现几次，因为它有更高的概率被选为分裂节点的候选者）。

说得更清楚一点，我认为除了避免合并显然会在 FeatureUnion 中创建重复特征的特征提取过程之外，您无能为力。

使用特征联合时如何确保不相交的特征集

How to ensure disjoint set of features when using Feature Union

python

scikit-learn