获取转换后逻辑回归最重要特征的名称

Get names of the most important features for Logistic Regression after transformation

我想获取转换后逻辑回归最重要特征的名称。

columns_for_encoding = ['a', 'b', 'c', 'd', 'e', 'f',
                        'g','h','i','j','k','l', 'm', 
                        'n', 'o', 'p']

columns_for_scaling = ['aa', 'bb', 'cc', 'dd', 'ee']

transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown = "ignore"), columns_for_encoding),
                                                        ('Normalizer', Normalizer(), columns_for_scaling)],
                                          remainder='passthrough') 

我知道我可以做到:

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size = 0.2, random_state=42)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)

clf = LogisticRegression(max_iter = 5000, class_weight = {1: 3.5, 0: 1})
model = clf.fit(x_train, y_train)

importance = model.coef_[0]

# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

但是我得到了 feature1、feature2、feature3...等。 转换后我有大约 45k 个特征。

如何获得最重要特征的列表(转换前)? 我想知道该模型的最佳功能是什么。我有很多具有 100 多个不同类别的分类特征,因此在编码之后我拥有的特征比数据集中的行更多。所以我想找出我可以从我的数据集中排除哪些特征以及哪些特征最适合我的模型。

重要 我还有其他使用但未转换的功能...因此我将 remainder='passthrough'

正如您已经意识到的那样,特征重要性 的整个想法对于 LogisticRegression 的情况来说有点棘手。您可以从这些帖子中了解更多信息:

  1. Feature Importance in Logistic Regression for Machine Learning Interpretability
  2. How to Calculate Feature Importance With Python

我个人认为这些和其他类似的帖子没有定论,所以我将在我的回答中避免这部分,并解决您关于特征拆分和聚合 特征 重要性(假设它们可用于拆分功能)使用 RandomForestClassifier。我还假设父特征的 重要性 是子特征的总和。

在这些假设下,我们可以使用下面的代码来获得原始特征的重要性。我正在使用 Palmer Archipelago (Antarctica) penguin data 作为插图。

df = pd.read_csv('./data/penguins_size.csv')
df = df.dropna()
# to comply with the assumption later that column names don't contain _
df.columns = [c.replace('_', '-') for c in df.columns]

X = df.iloc[:, :-1]
y = np.asarray(df.iloc[:, 6] == 'MALE').astype(int)

pd.options.display.width = 0
print(X.head())
species island culmen-length-mm culmen-depth-mm flipper-length-mm body-mass-g
Adelie Torgersen 39.1 18.7 181.0 3750.0
Adelie Torgersen 39.5 17.4 186.0 3800.0
Adelie Torgersen 40.3 18.0 195.0 3250.0
Adelie Torgersen 36.7 19.3 193.0 3450.0
Adelie Torgersen 39.3 20.6 190.0 3650.0
columns_for_encoding = ['species', 'island']
columns_for_scaling = ['culmen-length-mm', 'culmen-depth-mm']

transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown="ignore"), columns_for_encoding), ('Normalizer', Normalizer(), columns_for_scaling)], remainder='passthrough')

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)

clf = RandomForestClassifier(max_depth=5)
model = clf.fit(x_train, y_train)

importance = model.feature_importances_

# feature names derived from the encoded columns and their individual importances
# encoded cols
enc_col_out = transformerVectoriser.named_transformers_['Vector Cat'].get_feature_names_out()
enc_col_out_imp = importance[transformerVectoriser.output_indices_['Vector Cat']]
# normalized cols
norm_col = transformerVectoriser.named_transformers_['Normalizer'].feature_names_in_
norm_col_imp = importance[transformerVectoriser.output_indices_['Normalizer']]
# remainder cols, require a quick lookup as no transformer object exists for this case
rem_cols = []
for (tname, _, cs) in transformerVectoriser.transformers_:
    if tname == 'remainder': rem_cols = X.columns[cs]; break
rem_col_imp = importance[transformerVectoriser.output_indices_['remainder']]

# storing them in a df for easy manipulation
imp_df = pd.DataFrame({'feature': (list(enc_col_out) + list(norm_col) + list(rem_cols)), 'importance': (list(enc_col_out_imp) + list(norm_col_imp) + list(rem_col_imp))})

# aggregating, assuming that column names don't contain _ just to keep it simple
imp_df['feature'] = imp_df['feature'].apply(lambda x: x.split('_')[0])
imp_agg = imp_df.groupby(by=['feature']).sum()
print(imp_agg)
print(f'Sum of feature importances: {imp_df["importance"].sum()}')

输出: