如何在已经训练好的 xgboost 模型上使用 CalibratedClassifierCV?
How to use CalibratedClassifierCV on already trained xgboost model?
我想校准我已经训练过的 xgboost 模型。根据文档:
If “prefit” is passed, it is assumed that base_estimator has been
fitted already and all data is used for calibration.
所以我尝试如下使用它:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.calibration import CalibratedClassifierCV
X, y = make_classification()
X = pd.DataFrame(X)
X.columns = ['var' + str(i) for i in range(1, 21)]
y = pd.Series(y)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = XGBClassifier()
model.fit(X_train, y_train)
calibrated = CalibratedClassifierCV(model, method='isotonic', cv='prefit')
calibrated.fit(X_test, y_test)
不幸的是,这导致了以下错误:
ValueError: feature_names mismatch: ['var1', 'var2', 'var3', 'var4',
'var5', 'var6', 'var7', 'var8', 'var9', 'var10', 'var11', 'var12',
'var13', 'var14', 'var15', 'var16', 'var17', 'var18', 'var19',
'var20'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9',
'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19']
expected var12, var10, var3, var1, var20, var15, var2, var9, var16,
var7, var17, var11, var8, var5, var13, var4, var14, var6, var19, var18
in input data training data did not have the following fields: f2, f5,
f16, f17, f13, f11, f18, f6, f9, f1, f12, f10, f19, f15, f14, f3, f7,
f0, f4, f8
我认为这可能是因为特征以默认名称 f1
、f2
等存储在 xgboost 对象中。因此,我尝试重命名 X_test
列使用 X_test.rename(lambda x: x.replace('var', 'f'), axis = 1)
,但没有解决问题。所以我的问题是:如何修复此错误并在训练有素的 xgboost
模型上使用 CalibratedClassifierCV
?
Pandas 导致问题。您将列名传递给错误的 sklearn 模型。
使用X_train, X_test, y_train, y_test = train_test_split(X.values, y.values)
,一切都会正常。
您需要将 numpy
数组传递给任何 sklearn
函数以实现完全兼容。
完整代码:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.calibration import CalibratedClassifierCV
X, y = make_classification()
X = pd.DataFrame(X)
X.columns = ['var' + str(i) for i in range(1, 21)]
y = pd.Series(y)
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values)
model = XGBClassifier()
model.fit(X_train, y_train)
calibrated = CalibratedClassifierCV(model, method='isotonic', cv='prefit')
calibrated.fit(X_test, y_test)
我想校准我已经训练过的 xgboost 模型。根据文档:
If “prefit” is passed, it is assumed that base_estimator has been fitted already and all data is used for calibration.
所以我尝试如下使用它:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.calibration import CalibratedClassifierCV
X, y = make_classification()
X = pd.DataFrame(X)
X.columns = ['var' + str(i) for i in range(1, 21)]
y = pd.Series(y)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = XGBClassifier()
model.fit(X_train, y_train)
calibrated = CalibratedClassifierCV(model, method='isotonic', cv='prefit')
calibrated.fit(X_test, y_test)
不幸的是,这导致了以下错误:
ValueError: feature_names mismatch: ['var1', 'var2', 'var3', 'var4', 'var5', 'var6', 'var7', 'var8', 'var9', 'var10', 'var11', 'var12', 'var13', 'var14', 'var15', 'var16', 'var17', 'var18', 'var19', 'var20'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19'] expected var12, var10, var3, var1, var20, var15, var2, var9, var16, var7, var17, var11, var8, var5, var13, var4, var14, var6, var19, var18 in input data training data did not have the following fields: f2, f5, f16, f17, f13, f11, f18, f6, f9, f1, f12, f10, f19, f15, f14, f3, f7, f0, f4, f8
我认为这可能是因为特征以默认名称 f1
、f2
等存储在 xgboost 对象中。因此,我尝试重命名 X_test
列使用 X_test.rename(lambda x: x.replace('var', 'f'), axis = 1)
,但没有解决问题。所以我的问题是:如何修复此错误并在训练有素的 xgboost
模型上使用 CalibratedClassifierCV
?
Pandas 导致问题。您将列名传递给错误的 sklearn 模型。
使用X_train, X_test, y_train, y_test = train_test_split(X.values, y.values)
,一切都会正常。
您需要将 numpy
数组传递给任何 sklearn
函数以实现完全兼容。
完整代码:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.calibration import CalibratedClassifierCV
X, y = make_classification()
X = pd.DataFrame(X)
X.columns = ['var' + str(i) for i in range(1, 21)]
y = pd.Series(y)
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values)
model = XGBClassifier()
model.fit(X_train, y_train)
calibrated = CalibratedClassifierCV(model, method='isotonic', cv='prefit')
calibrated.fit(X_test, y_test)