Scikit-learn - 使用 RFECV 和 GridSearch 减少特征。系数存储在哪里?
Scikit-learn - feature reduction using RFECV and GridSearch. Where are the coefficients stored?
我正在使用 Scikit-learn RFECV select 使用交叉验证的逻辑回归的最重要特征。假设 X 是特征的 [n,x] 数据框,y 代表响应变量:
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn import preprocessing
from sklearn.feature_selection import RFECV
import sklearn
import sklearn.linear_model as lm
import sklearn.grid_search as gs
# Create a logistic regression estimator
logreg = lm.LogisticRegression()
# Use RFECV to pick best features, using Stratified Kfold
rfecv = RFECV(estimator=logreg, cv=StratifiedKFold(y, 3), scoring='roc_auc')
# Fit the features to the response variable
rfecv.fit(X, y)
# Put the best features into new df X_new
X_new = rfecv.transform(X)
#
pipe = make_pipeline(preprocessing.StandardScaler(), lm.LogisticRegression())
# Define a range of hyper parameters for grid search
C_range = 10.**np.arange(-5, 1)
penalty_options = ['l1', 'l2']
skf = StratifiedKFold(y, 3)
param_grid = dict(logisticregression__C=C_range, logisticregression__penalty=penalty_options)
grid = GridSearchCV(pipe, param_grid, cv=skf, scoring='roc_auc')
grid.fit(X_new, y)
两个问题:
a) 这是特征、超参数 selection 和拟合的正确过程吗?
b) 我在哪里可以找到 selected 特征的拟合系数?
这是特征选择的正确过程吗?
这是特征选择的众多方法之一。递归特征消除是一种自动化的方法,others are listed in scikit.learn documentation。它们各有利弊,通常特征选择最好通过常识和尝试具有不同特征的模型来实现。 RFE 是一种快速选择一组好的功能的方法,但不一定会给你最终最好的。顺便说一下,您不需要单独构建 StratifiedKFold。如果您只是将 cv
参数设置为 cv=3
,那么 RFECV
和 GridSearchCV
都会在 y 值是二进制或多类时自动使用 StratifiedKFold,我假设这是最可能是这种情况,因为您使用的是 LogisticRegression
。
你也可以结合
# Fit the features to the response variable
rfecv.fit(X, y)
# Put the best features into new df X_new
X_new = rfecv.transform(X)
进入
X_new = rfecv.fit_transform(X, y)
这是超参数选择的正确过程吗?
GridSearchCV 基本上是一种系统地尝试一整套模型参数组合并根据某些性能指标从中选择最佳组合的自动化方法。是的,这是找到合适参数的好方法。
这是正确的拟合过程吗?
是的,这是拟合模型的有效方法。当您调用 grid.fit(X_new, y)
时,它会生成一个包含 LogisticRegression
个估计器的网格(每个估计器都有一组已尝试过的参数)并适合每个估计器。它将在 grid.best_estimator_
下保持最佳性能,在 grid.best_params_
下保留该估计器的参数,在 grid.best_score_
下保留该估计器的性能分数。它会 return 本身,而不是最好的估计器。请记住,对于将使用模型进行预测的传入新 X 值,您必须使用拟合的 RFECV 模型应用转换。因此,您实际上也可以将此步骤添加到管道中。
在哪里可以找到所选特征的拟合系数?
grid.best_estimator_
属性是一个包含所有这些信息的 LogisticRegression
对象,因此 grid.best_estimator_.coef_
具有所有系数(grid.best_estimator_.intercept_
是截距)。请注意,要获得此 grid.best_estimator_
,需要将 GridSearchCV
上的 refit
参数设置为 True
,但这是默认值。
本质上,您需要对样本数据进行训练-验证-测试拆分。其中训练集用于调整您的正常参数,验证集用于调整网格搜索中的超参数,以及用于性能评估的测试集。这是执行此操作的一种方法。
from sklearn.datasets import make_classification
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import pandas as pd
# simulate some artifical data so that I can show you the result of each intermediate step
# 1000 obs, X dim 1000-by-100, 2 different y labels with unbalanced weights
X, y = make_classification(n_samples=1000, n_features=100, n_informative=5, n_classes=2, weights=[0.1, 0.9])
X.shape
Out[78]: (1000, 100)
y.shape
Out[79]: (1000,)
# Nested Cross-Validation, this returns an train/test index interator
split = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=1)
# to take a look at the split, you will see it has 5 tuples
list(split)
# the 1st fold
train_index = list(split)[0][0]
Out[80]: array([ 0, 1, 2, ..., 997, 998, 999])
test_index = list(split)[0][1]
Out[81]: array([ 5, 12, 17, ..., 979, 982, 984])
# let's play with just one iteration for now
# your pipe
pipe = make_pipeline(StandardScaler(), LogisticRegression())
# set up params
params_space = dict(logisticregression__C=10.0**np.arange(-5,1),
logisticregression__penalty=['l1', 'l2'],
logisticregression__class_weight=[None, 'auto'])
# apply your grid search only in train data but with a futher cv step
# so original train set has [gscv_train, gscv_validation] where the latter is used to tune hyperparameters
# all performance is still evaluated in a separated held-out 'test' set
grid = GridSearchCV(pipe, params_space, cv=StratifiedKFold(y[train_index], n_folds=3), scoring='roc_auc')
# fit the data on train set
grid.fit(X[train_index], y[train_index])
# to get the params of your estimator, call your gscv
grid.best_estimator_
Out[82]:
Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=0.10000000000000001, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='ovr', penalty='l1', random_state=None,
solver='liblinear', tol=0.0001, verbose=0))])
# the performance in validation set
grid.grid_scores_
Out[83]:
[mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
mean: 0.87975, std: 0.01753, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
mean: 0.87985, std: 0.01746, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
mean: 0.88033, std: 0.01707, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
mean: 0.87975, std: 0.01732, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
mean: 0.88245, std: 0.01732, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
mean: 0.87955, std: 0.01686, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
mean: 0.88746, std: 0.02318, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
mean: 0.87990, std: 0.01634, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
mean: 0.94002, std: 0.02959, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
mean: 0.87419, std: 0.02174, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
mean: 0.93508, std: 0.03101, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
mean: 0.87091, std: 0.01860, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
mean: 0.88013, std: 0.03246, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
mean: 0.85247, std: 0.02712, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
mean: 0.88904, std: 0.02906, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
mean: 0.85197, std: 0.02097, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'}]
# or the best score among them
grid.best_score_
Out[84]: 0.94002188482393367
# now after finishing training the estimator, we now predict in test set
y_pred = grid.predict(X[test_index])
# since LogisticRegression is probability based model, we have the luxury to get the propability for each obs
y_pred_probs = grid.predict_proba(X[test_index])
Out[87]:
array([[ 0.0632, 0.9368],
[ 0.0236, 0.9764],
[ 0.0227, 0.9773],
...,
[ 0.0108, 0.9892],
[ 0.2903, 0.7097],
[ 0.0113, 0.9887]])
# to get evaluation result,
print(classification_report(y[test_index], y_pred))
precision recall f1-score support
0 0.93 0.59 0.72 22
1 0.95 0.99 0.97 179
avg / total 0.95 0.95 0.95 201
# to put all things together with the nested cross-validation
# generate a pandas dataframe to store prediction probability
kfold_df = pd.DataFrame(0.0, index=np.arange(len(y)), columns=unique(y))
report = [] # to store classificaiton report
split = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=1)
for train_index, test_index in split:
grid = GridSearchCV(pipe, params_space, cv=StratifiedKFold(y[train_index], n_folds=3), scoring='roc_auc')
grid.fit(X[train_index], y[train_index])
y_pred_probs = grid.predict_proba(X[test_index])
kfold_df.iloc[test_index, :] = y_pred_probs
y_pred = grid.predict(X[test_index])
report.append(classification_report(y[test_index], y_pred))
# your result
print(kfold_df)
Out[88]:
0 1
0 0.1710 0.8290
1 0.0083 0.9917
2 0.2049 0.7951
3 0.0038 0.9962
4 0.0536 0.9464
5 0.0632 0.9368
6 0.1243 0.8757
7 0.1150 0.8850
8 0.0796 0.9204
9 0.4096 0.5904
.. ... ...
990 0.0505 0.9495
991 0.2128 0.7872
992 0.0270 0.9730
993 0.0434 0.9566
994 0.8078 0.1922
995 0.1452 0.8548
996 0.1372 0.8628
997 0.0127 0.9873
998 0.0935 0.9065
999 0.0065 0.9935
[1000 rows x 2 columns]
for r in report:
print(r)
for r in report:
print(r)
precision recall f1-score support
0 0.93 0.59 0.72 22
1 0.95 0.99 0.97 179
avg / total 0.95 0.95 0.95 201
precision recall f1-score support
0 0.86 0.55 0.67 22
1 0.95 0.99 0.97 179
avg / total 0.94 0.94 0.93 201
precision recall f1-score support
0 0.89 0.38 0.53 21
1 0.93 0.99 0.96 179
avg / total 0.93 0.93 0.92 200
precision recall f1-score support
0 0.88 0.33 0.48 21
1 0.93 0.99 0.96 178
avg / total 0.92 0.92 0.91 199
precision recall f1-score support
0 0.88 0.33 0.48 21
1 0.93 0.99 0.96 178
avg / total 0.92 0.92 0.91 199
我正在使用 Scikit-learn RFECV select 使用交叉验证的逻辑回归的最重要特征。假设 X 是特征的 [n,x] 数据框,y 代表响应变量:
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn import preprocessing
from sklearn.feature_selection import RFECV
import sklearn
import sklearn.linear_model as lm
import sklearn.grid_search as gs
# Create a logistic regression estimator
logreg = lm.LogisticRegression()
# Use RFECV to pick best features, using Stratified Kfold
rfecv = RFECV(estimator=logreg, cv=StratifiedKFold(y, 3), scoring='roc_auc')
# Fit the features to the response variable
rfecv.fit(X, y)
# Put the best features into new df X_new
X_new = rfecv.transform(X)
#
pipe = make_pipeline(preprocessing.StandardScaler(), lm.LogisticRegression())
# Define a range of hyper parameters for grid search
C_range = 10.**np.arange(-5, 1)
penalty_options = ['l1', 'l2']
skf = StratifiedKFold(y, 3)
param_grid = dict(logisticregression__C=C_range, logisticregression__penalty=penalty_options)
grid = GridSearchCV(pipe, param_grid, cv=skf, scoring='roc_auc')
grid.fit(X_new, y)
两个问题:
a) 这是特征、超参数 selection 和拟合的正确过程吗?
b) 我在哪里可以找到 selected 特征的拟合系数?
这是特征选择的正确过程吗?
这是特征选择的众多方法之一。递归特征消除是一种自动化的方法,others are listed in scikit.learn documentation。它们各有利弊,通常特征选择最好通过常识和尝试具有不同特征的模型来实现。 RFE 是一种快速选择一组好的功能的方法,但不一定会给你最终最好的。顺便说一下,您不需要单独构建 StratifiedKFold。如果您只是将 cv
参数设置为 cv=3
,那么 RFECV
和 GridSearchCV
都会在 y 值是二进制或多类时自动使用 StratifiedKFold,我假设这是最可能是这种情况,因为您使用的是 LogisticRegression
。
你也可以结合
# Fit the features to the response variable
rfecv.fit(X, y)
# Put the best features into new df X_new
X_new = rfecv.transform(X)
进入
X_new = rfecv.fit_transform(X, y)
这是超参数选择的正确过程吗? GridSearchCV 基本上是一种系统地尝试一整套模型参数组合并根据某些性能指标从中选择最佳组合的自动化方法。是的,这是找到合适参数的好方法。
这是正确的拟合过程吗?
是的,这是拟合模型的有效方法。当您调用 grid.fit(X_new, y)
时,它会生成一个包含 LogisticRegression
个估计器的网格(每个估计器都有一组已尝试过的参数)并适合每个估计器。它将在 grid.best_estimator_
下保持最佳性能,在 grid.best_params_
下保留该估计器的参数,在 grid.best_score_
下保留该估计器的性能分数。它会 return 本身,而不是最好的估计器。请记住,对于将使用模型进行预测的传入新 X 值,您必须使用拟合的 RFECV 模型应用转换。因此,您实际上也可以将此步骤添加到管道中。
在哪里可以找到所选特征的拟合系数?
grid.best_estimator_
属性是一个包含所有这些信息的 LogisticRegression
对象,因此 grid.best_estimator_.coef_
具有所有系数(grid.best_estimator_.intercept_
是截距)。请注意,要获得此 grid.best_estimator_
,需要将 GridSearchCV
上的 refit
参数设置为 True
,但这是默认值。
本质上,您需要对样本数据进行训练-验证-测试拆分。其中训练集用于调整您的正常参数,验证集用于调整网格搜索中的超参数,以及用于性能评估的测试集。这是执行此操作的一种方法。
from sklearn.datasets import make_classification
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import pandas as pd
# simulate some artifical data so that I can show you the result of each intermediate step
# 1000 obs, X dim 1000-by-100, 2 different y labels with unbalanced weights
X, y = make_classification(n_samples=1000, n_features=100, n_informative=5, n_classes=2, weights=[0.1, 0.9])
X.shape
Out[78]: (1000, 100)
y.shape
Out[79]: (1000,)
# Nested Cross-Validation, this returns an train/test index interator
split = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=1)
# to take a look at the split, you will see it has 5 tuples
list(split)
# the 1st fold
train_index = list(split)[0][0]
Out[80]: array([ 0, 1, 2, ..., 997, 998, 999])
test_index = list(split)[0][1]
Out[81]: array([ 5, 12, 17, ..., 979, 982, 984])
# let's play with just one iteration for now
# your pipe
pipe = make_pipeline(StandardScaler(), LogisticRegression())
# set up params
params_space = dict(logisticregression__C=10.0**np.arange(-5,1),
logisticregression__penalty=['l1', 'l2'],
logisticregression__class_weight=[None, 'auto'])
# apply your grid search only in train data but with a futher cv step
# so original train set has [gscv_train, gscv_validation] where the latter is used to tune hyperparameters
# all performance is still evaluated in a separated held-out 'test' set
grid = GridSearchCV(pipe, params_space, cv=StratifiedKFold(y[train_index], n_folds=3), scoring='roc_auc')
# fit the data on train set
grid.fit(X[train_index], y[train_index])
# to get the params of your estimator, call your gscv
grid.best_estimator_
Out[82]:
Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=0.10000000000000001, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='ovr', penalty='l1', random_state=None,
solver='liblinear', tol=0.0001, verbose=0))])
# the performance in validation set
grid.grid_scores_
Out[83]:
[mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
mean: 0.87975, std: 0.01753, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
mean: 0.87985, std: 0.01746, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
mean: 0.88033, std: 0.01707, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
mean: 0.87975, std: 0.01732, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
mean: 0.88245, std: 0.01732, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
mean: 0.87955, std: 0.01686, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
mean: 0.88746, std: 0.02318, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
mean: 0.87990, std: 0.01634, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
mean: 0.94002, std: 0.02959, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
mean: 0.87419, std: 0.02174, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
mean: 0.93508, std: 0.03101, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
mean: 0.87091, std: 0.01860, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
mean: 0.88013, std: 0.03246, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
mean: 0.85247, std: 0.02712, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
mean: 0.88904, std: 0.02906, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
mean: 0.85197, std: 0.02097, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'}]
# or the best score among them
grid.best_score_
Out[84]: 0.94002188482393367
# now after finishing training the estimator, we now predict in test set
y_pred = grid.predict(X[test_index])
# since LogisticRegression is probability based model, we have the luxury to get the propability for each obs
y_pred_probs = grid.predict_proba(X[test_index])
Out[87]:
array([[ 0.0632, 0.9368],
[ 0.0236, 0.9764],
[ 0.0227, 0.9773],
...,
[ 0.0108, 0.9892],
[ 0.2903, 0.7097],
[ 0.0113, 0.9887]])
# to get evaluation result,
print(classification_report(y[test_index], y_pred))
precision recall f1-score support
0 0.93 0.59 0.72 22
1 0.95 0.99 0.97 179
avg / total 0.95 0.95 0.95 201
# to put all things together with the nested cross-validation
# generate a pandas dataframe to store prediction probability
kfold_df = pd.DataFrame(0.0, index=np.arange(len(y)), columns=unique(y))
report = [] # to store classificaiton report
split = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=1)
for train_index, test_index in split:
grid = GridSearchCV(pipe, params_space, cv=StratifiedKFold(y[train_index], n_folds=3), scoring='roc_auc')
grid.fit(X[train_index], y[train_index])
y_pred_probs = grid.predict_proba(X[test_index])
kfold_df.iloc[test_index, :] = y_pred_probs
y_pred = grid.predict(X[test_index])
report.append(classification_report(y[test_index], y_pred))
# your result
print(kfold_df)
Out[88]:
0 1
0 0.1710 0.8290
1 0.0083 0.9917
2 0.2049 0.7951
3 0.0038 0.9962
4 0.0536 0.9464
5 0.0632 0.9368
6 0.1243 0.8757
7 0.1150 0.8850
8 0.0796 0.9204
9 0.4096 0.5904
.. ... ...
990 0.0505 0.9495
991 0.2128 0.7872
992 0.0270 0.9730
993 0.0434 0.9566
994 0.8078 0.1922
995 0.1452 0.8548
996 0.1372 0.8628
997 0.0127 0.9873
998 0.0935 0.9065
999 0.0065 0.9935
[1000 rows x 2 columns]
for r in report:
print(r)
for r in report:
print(r)
precision recall f1-score support
0 0.93 0.59 0.72 22
1 0.95 0.99 0.97 179
avg / total 0.95 0.95 0.95 201
precision recall f1-score support
0 0.86 0.55 0.67 22
1 0.95 0.99 0.97 179
avg / total 0.94 0.94 0.93 201
precision recall f1-score support
0 0.89 0.38 0.53 21
1 0.93 0.99 0.96 179
avg / total 0.93 0.93 0.92 200
precision recall f1-score support
0 0.88 0.33 0.48 21
1 0.93 0.99 0.96 178
avg / total 0.92 0.92 0.91 199
precision recall f1-score support
0 0.88 0.33 0.48 21
1 0.93 0.99 0.96 178
avg / total 0.92 0.92 0.91 199