在 sklearn 中使用 StratifiedKFold 分层折叠

Question

我不太了解 sklearn 函数 train_test_split 和 StratifiedKFold 背后的逻辑，用于根据多个“列”而不仅仅是根据目标分布获得平衡拆分。我知道前面的句子有点晦涩，所以我希望下面的代码能有所帮助。

import numpy as np
import pandas as pd
import random

n_samples = 100
prob = 0.2
pos = int(n_samples * prob)
neg = n_samples - pos

target = [1] * pos + [0] * neg
cat = ["a"] * 50 + ["b"] * 50
random.shuffle(target)
random.shuffle(cat)

ds = pd.DataFrame()
ds["target"] = target
ds["cat"] = cat
ds["f1"] = np.random.random(size=(n_samples,))
ds["f2"] = np.random.random(size=(n_samples,))
print(ds.head())

这是一个包含 100 个示例的数据集，目标分布受 p 支配，在本例中我们有 20% 的正例。有一个二进制分类列 cat，完全平衡。前面代码的输出是：

     target cat        f1        f2
0       0   a  0.970585  0.134268
1       0   a  0.410689  0.225524
2       0   a  0.638111  0.273830
3       0   b  0.594726  0.579668
4       0   a  0.737440  0.667996

with train_test_split(), stratify on target and cat, 如果我们研究频率，我们得到：

from sklearn.model_selection import train_test_split, StratifiedKFold

# with train_test_split
training, valid = train_test_split(range(n_samples), 
                test_size=20, 
                stratify=ds[["target", "cat"]])

print("---")
print("* training")
print(ds.loc[training, ["target", "cat"]].value_counts() / len(training))  # balanced
print("* validation")
print(ds.loc[valid, ["target", "cat"]].value_counts() / len(valid))  # balanced

我们得到这个：

* dataset
0    0.8
1    0.2
Name: target, dtype: float64
target  cat
0       a      0.4
        b      0.4
1       a      0.1
        b      0.1
dtype: float64
---
* training
target  cat
0       a      0.4
        b      0.4
1       a      0.1
        b      0.1
dtype: float64
* validation
target  cat
0       a      0.4
        b      0.4
1       a      0.1
        b      0.1
dtype: float64

层次分明。

现在 StratifiedKFold:

# with stratified k-fold
skf = StratifiedKFold(n_splits=5)
try:
    for train, valid in skf.split(X=range(len(ds)), y=ds[["target", "cat"]]):
        pass
except:
    print("! does not work")


for train, valid in skf.split(X=range(len(ds)), y=ds.target):
    print("happily iterating")

输出：

! does not work
happily iterating
happily iterating
happily iterating
happily iterating
happily iterating

如何获得我用 train_test_split 和 StratifiedKFold 得到的东西？我知道在 k 折交叉验证中可能存在不允许此类分层的数据分布，但我不明白为什么 train_test_split 接受两列或更多列而另一种方法不接受。

Answer 1

目前这似乎不太可能。

多标签不完全是您要查找的内容，但相关。那是 , and was an Issue on sklearn's github（不确定为什么关闭）。

作为一个 hack，您应该能够将您的两列组合成一个具有有序对的新列，并对其进行分层？

在 sklearn 中使用 StratifiedKFold 分层折叠

Stratifying folds with StratifiedKFold in sklearn

scikit-learn