Sklearn - 排列重要性导致模型中零系数的非零值
Sklearn - Permutation Importance leads to non-zero values for zero-coefficients in model
我对 sklearn 的 permutation_importance
函数感到困惑。我已经安装了一个带有正则化逻辑回归的管道,导致 几个特征系数为 0。但是,当我想计算特征在测试数据集上的排列重要性时,其中一些特征获得非零重要性值。
他们对分类器没有贡献怎么会这样?
下面是一些示例代码和数据:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
import scipy.stats as stats
from sklearn.utils.fixes import loguniform
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance
# create example data with missings
X, y = make_classification(n_samples = 500,
n_features = 100,
n_informative = 25,
n_redundant = 75,
random_state = 0)
c = 10000 # number of missings
X.ravel()[np.random.choice(X.size, c, replace = False)] = np.nan # introduce random missings
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = 0.2, random_state = 0)
folds = 5
repeats = 5
n_iter = 25
rskfold = RepeatedStratifiedKFold(n_splits = folds, n_repeats = repeats, random_state = 1897)
scl = StandardScaler()
imp = KNNImputer(n_neighbors = 5, weights = 'uniform')
sgdc = SGDClassifier(loss = 'log', penalty = 'elasticnet', class_weight = 'balanced', random_state = 0)
pipe = Pipeline([('scaler', scl),
('imputer', imp),
('clf', sgdc)])
param_rand = {'clf__l1_ratio': stats.uniform(0, 1),
'clf__alpha': loguniform(0.001, 1)}
m = RandomizedSearchCV(pipe, param_rand, n_iter = n_iter, cv = rskfold, scoring = 'accuracy', random_state = 0, verbose = 1, n_jobs = -1)
m.fit(Xtrain, ytrain)
coefs = m.best_estimator_.steps[2][1].coef_
print('Number of non-zero feature coefficients in classifier:')
print(np.sum(coefs != 0))
imps = permutation_importance(m, Xtest, ytest, n_repeats = 25, random_state = 0, n_jobs = -1)
print('Number of non-zero feature importances after permutations:')
print(np.sum(imps['importances_mean'] != 0))
您会看到第二个打印的数字与第一个不匹配...
非常感谢任何帮助!
那是因为你有一个KNNImputer
。模型中系数为零的特征仍会影响其他列的插补,因此可以在排列时改变整个管道的预测,因此可以具有非零排列重要性。
我对 sklearn 的 permutation_importance
函数感到困惑。我已经安装了一个带有正则化逻辑回归的管道,导致 几个特征系数为 0。但是,当我想计算特征在测试数据集上的排列重要性时,其中一些特征获得非零重要性值。
他们对分类器没有贡献怎么会这样?
下面是一些示例代码和数据:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
import scipy.stats as stats
from sklearn.utils.fixes import loguniform
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance
# create example data with missings
X, y = make_classification(n_samples = 500,
n_features = 100,
n_informative = 25,
n_redundant = 75,
random_state = 0)
c = 10000 # number of missings
X.ravel()[np.random.choice(X.size, c, replace = False)] = np.nan # introduce random missings
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = 0.2, random_state = 0)
folds = 5
repeats = 5
n_iter = 25
rskfold = RepeatedStratifiedKFold(n_splits = folds, n_repeats = repeats, random_state = 1897)
scl = StandardScaler()
imp = KNNImputer(n_neighbors = 5, weights = 'uniform')
sgdc = SGDClassifier(loss = 'log', penalty = 'elasticnet', class_weight = 'balanced', random_state = 0)
pipe = Pipeline([('scaler', scl),
('imputer', imp),
('clf', sgdc)])
param_rand = {'clf__l1_ratio': stats.uniform(0, 1),
'clf__alpha': loguniform(0.001, 1)}
m = RandomizedSearchCV(pipe, param_rand, n_iter = n_iter, cv = rskfold, scoring = 'accuracy', random_state = 0, verbose = 1, n_jobs = -1)
m.fit(Xtrain, ytrain)
coefs = m.best_estimator_.steps[2][1].coef_
print('Number of non-zero feature coefficients in classifier:')
print(np.sum(coefs != 0))
imps = permutation_importance(m, Xtest, ytest, n_repeats = 25, random_state = 0, n_jobs = -1)
print('Number of non-zero feature importances after permutations:')
print(np.sum(imps['importances_mean'] != 0))
您会看到第二个打印的数字与第一个不匹配...
非常感谢任何帮助!
那是因为你有一个KNNImputer
。模型中系数为零的特征仍会影响其他列的插补,因此可以在排列时改变整个管道的预测,因此可以具有非零排列重要性。