获取转换后逻辑回归最重要特征的名称

Question

我想获取转换后逻辑回归最重要特征的名称。

columns_for_encoding = ['a', 'b', 'c', 'd', 'e', 'f',
                        'g','h','i','j','k','l', 'm', 
                        'n', 'o', 'p']

columns_for_scaling = ['aa', 'bb', 'cc', 'dd', 'ee']

transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown = "ignore"), columns_for_encoding),
                                                        ('Normalizer', Normalizer(), columns_for_scaling)],
                                          remainder='passthrough')

我知道我可以做到：

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size = 0.2, random_state=42)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)

clf = LogisticRegression(max_iter = 5000, class_weight = {1: 3.5, 0: 1})
model = clf.fit(x_train, y_train)

importance = model.coef_[0]

# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

但是我得到了 feature1、feature2、feature3...等。转换后我有大约 45k 个特征。

如何获得最重要特征的列表（转换前）？我想知道该模型的最佳功能是什么。我有很多具有 100 多个不同类别的分类特征，因此在编码之后我拥有的特征比数据集中的行更多。所以我想找出我可以从我的数据集中排除哪些特征以及哪些特征最适合我的模型。

重要我还有其他使用但未转换的功能...因此我将 remainder='passthrough'

Answer 1

正如您已经意识到的那样，特征重要性 的整个想法对于 LogisticRegression 的情况来说有点棘手。您可以从这些帖子中了解更多信息：

我个人认为这些和其他类似的帖子没有定论，所以我将在我的回答中避免这部分，并解决您关于特征拆分和聚合特征 重要性（假设它们可用于拆分功能）使用 RandomForestClassifier。我还假设父特征的 重要性 是子特征的总和。

在这些假设下，我们可以使用下面的代码来获得原始特征的重要性。我正在使用 Palmer Archipelago (Antarctica) penguin data 作为插图。

df = pd.read_csv('./data/penguins_size.csv')
df = df.dropna()
# to comply with the assumption later that column names don't contain _
df.columns = [c.replace('_', '-') for c in df.columns]

X = df.iloc[:, :-1]
y = np.asarray(df.iloc[:, 6] == 'MALE').astype(int)

pd.options.display.width = 0
print(X.head())

species	island	culmen-length-mm	culmen-depth-mm	flipper-length-mm	body-mass-g
Adelie	Torgersen	39.1	18.7	181.0	3750.0
Adelie	Torgersen	39.5	17.4	186.0	3800.0
Adelie	Torgersen	40.3	18.0	195.0	3250.0
Adelie	Torgersen	36.7	19.3	193.0	3450.0
Adelie	Torgersen	39.3	20.6	190.0	3650.0

columns_for_encoding = ['species', 'island']
columns_for_scaling = ['culmen-length-mm', 'culmen-depth-mm']

transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown="ignore"), columns_for_encoding), ('Normalizer', Normalizer(), columns_for_scaling)], remainder='passthrough')

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)

clf = RandomForestClassifier(max_depth=5)
model = clf.fit(x_train, y_train)

importance = model.feature_importances_

# feature names derived from the encoded columns and their individual importances
# encoded cols
enc_col_out = transformerVectoriser.named_transformers_['Vector Cat'].get_feature_names_out()
enc_col_out_imp = importance[transformerVectoriser.output_indices_['Vector Cat']]
# normalized cols
norm_col = transformerVectoriser.named_transformers_['Normalizer'].feature_names_in_
norm_col_imp = importance[transformerVectoriser.output_indices_['Normalizer']]
# remainder cols, require a quick lookup as no transformer object exists for this case
rem_cols = []
for (tname, _, cs) in transformerVectoriser.transformers_:
    if tname == 'remainder': rem_cols = X.columns[cs]; break
rem_col_imp = importance[transformerVectoriser.output_indices_['remainder']]

# storing them in a df for easy manipulation
imp_df = pd.DataFrame({'feature': (list(enc_col_out) + list(norm_col) + list(rem_cols)), 'importance': (list(enc_col_out_imp) + list(norm_col_imp) + list(rem_col_imp))})

# aggregating, assuming that column names don't contain _ just to keep it simple
imp_df['feature'] = imp_df['feature'].apply(lambda x: x.split('_')[0])
imp_agg = imp_df.groupby(by=['feature']).sum()
print(imp_agg)
print(f'Sum of feature importances: {imp_df["importance"].sum()}')

输出：

获取转换后逻辑回归最重要特征的名称

Get names of the most important features for Logistic Regression after transformation

python

machine-learning

scikit-learn