sklearn 分层 k 折 CV 与线性模型,如 ElasticNetCV

sklearn stratified k-fold CV with linear model like ElasticNetCV

将交叉验证 (CV) 与 sklearn 结合使用非常简单直接。但是在线性 CV 模型中设置 cv=5 时的默认实现,如 ElasticNetCVLassoCVKFold CV。由于各种原因,我想使用 StratifiedKFold。从 documentation 看来 any CV 方法可以用 cv=.

给出

传递 cv=KFold(5) 按预期工作,但 cv=StratifiedKFold(5) 引发错误:

ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.

我知道我可以在拟合后使用cross_val_score,但我想将StratifiedKFold作为CV直接传递给线性模型。

我的最小工作示例是:

from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import KFold, StratifiedKFold
import numpy as np

x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.arange(100) + np.random.rand(100)

# KFold default implementation:
model_default = ElasticNetCV(cv=5)
model_default.fit(x, y)  # works fine
# KFold given as cv explicitly:
model_kfexp = ElasticNetCV(cv=KFold(5))
model_kfexp.fit(x, y)  # also works fine

# StratifiedKFold given as cv explicitly:
model_skf = ElasticNetCV(cv=StratifiedKFold(5))
model_skf.fit(x, y)  # THIS RAISES THE ERROR

知道如何将 StratifiedKFold 直接设置为 CV 吗?

问题的根源在于这一行:

y = np.arange(100) + np.random.rand(100)

StratifiedKFold 无法从连续分布中抽样,因此您的错误。尝试更改此行,您的代码将愉快地执行:

from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import KFold, StratifiedKFold
import numpy as np

x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.random.choice([0,1], size=100)

# KFold default implementation:
model_default = ElasticNetCV(cv=5)
model_default.fit(x, y)  # works fine
# KFold given as cv explicitly:
model_kfexp = ElasticNetCV(cv=KFold(5))
model_kfexp.fit(x, y)  # also works fine

# StratifiedKFold given as cv explicitly:
model_skf = ElasticNetCV(cv=StratifiedKFold(5))
model_skf.fit(x, y)  # no ERROR

注意

如果您对连续数据进行采样,请使用 KFold。如果您的目标是分类的,您可以同时使用 KFoldStratifiedKFold,以适合您的需要为准。

注 2

如果您坚持模拟连续数据的分层抽样,您可能希望对您的数据应用pandas.cut,然后对该数据进行分层抽样,最后将生成的 (train_id, test_id) 生成器传递给 cv 参数:

x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.arange(100) + np.random.rand(100)

y_cat = pd.cut(y, 10, labels=range(10))
skf_gen = StratifiedKFold(5).split(x, y_cat)

model_skf = ElasticNetCV(cv=skf_gen)
model_skf.fit(x, y)  # no ERROR