具有不平衡类的 k 折分层交叉验证

Question

我有 4 个 classes 的数据，我正在尝试构建一个 classifier。我有 ~1000 个矢量用于一个 class，另一个为 ~10^4，第三个为 ~10^5，第四个为 ~10^6。我希望使用交叉验证，所以我查看了 scikit-learn docs 。

我的第一个尝试是使用 StratifiedShuffleSplit 但这给每个 class 相同的百分比，使 classes 仍然严重不平衡。

Is there a way to do cross-validation but with the classes balanced in the training and test set?

作为旁注，我无法弄清楚 StratifiedShuffleSplit 和 StratifiedKFold 之间的区别。描述和我很像。

Answer 1

My first try was to use StratifiedShuffleSplit but this gives the same percentage for each class, leaving the classes drastically imbalanced still.

我觉得你混淆了分层策略的作用，但你需要展示你的代码和你的结果来确定发生了什么（与原始百分比相同的百分比）集，还是在返回的训练/测试集中的相同百分比？第一个是它应该的样子）。

As a side note, I couldn't work out the difference between StratifiedShuffleSplit and StratifiedKFold . The descriptions look very similar to me.

其中一个应该绝对有效。第一个的描述肯定有点令人困惑，但这是他们所做的。

StratifiedShuffleSplit

Provides train/test indices to split data in train test sets.

这意味着它将您的数据拆分为训练集和测试集。分层部分意味着百分比将在此拆分中保持。因此，如果 10% 的数据在 class 1 中，而 90% 在 class 2 中， 这将确保 10% 的火车set 将在 class 1 中，90% 将在 class 2 中。测试集也一样。

你的 post 听起来你想要测试集中每个 class 的 50%。这不是分层的作用，分层会保持原始百分比。你应该维护它们，因为否则你会给自己一个关于你的 classifier 性能的无关紧要的想法：谁在乎它 classified a 50/50 split 的效果，而在实践中你会看到 10/90 分裂吗？

分层折叠

This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

参见 k-fold cross validation。如果不进行分层，它只会将您的数据分成 k 份。然后，每个fold 1 <= i <= k被用作测试集一次，而其他的用于训练。最后对结果进行平均。它类似于运行 ShuffleSplit k 次。

分层将确保整个数据中每个 class 的百分比在每个单独的折叠中相同（或非常接近）。

有很多文献处理不平衡 classes。一些简单易用的方法包括使用 class 权重和分析 ROC 曲线。我建议使用以下资源作为起点：

Answer 2

K 折 CV

K-Fold CV 的工作原理是将您的数据随机划分为 k（相当）相等的分区。如果您的数据在 class 之间均匀平衡，例如 [0,1,0,1,0,1,0,1,0,1]，随机抽样（或不放回）将为您提供大致相等的样本量 0 和 1。

但是，如果你的数据更像 [0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0] 其中一个 class 代表数据，没有加权抽样的 k-fold cv 会给你错误的结果。

如果你使用普通的k-fold CV而不调整均匀采样的采样权重，那么你会得到类似

的东西

## k-fold CV
k = 5
splits = np.array_split(y, k)
for i in range(k):
    print(np.mean(splits[i]))

 [array([0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0]),
 array([0, 1, 1, 1, 1, 1])]

两个 classes.

没有有效表示的明显分裂

k 折 CV 的要点是 train/test 跨所有数据子集的模型，而在每次试验中遗漏 1 个子集并在 k-1 个子集上进行训练。

在这种情况下，您希望使用按层拆分。在上面的数据集中，有27个0s和5个1s。如果您想计算 k=5 CV，将 1 的层拆分为 5 个子集是不合理的。更好的解决方案是将其拆分为 k < 5 个子集，例如 2。0s 的层可以保留 k=5 个拆分，因为它要大得多。然后在训练时，您将从数据集中获得 2 x 5 的简单乘积。下面是一些代码来说明

from itertools import product

for strata, iterable in groupby(y):
    data = np.array(list(iterable))
    if strata == 0:
        zeros = np.array_split(data, 5)
    else:
        ones = np.array_split(data, 2)


cv_splits = list(product(zeros, ones))
print(cv_splits)

m = len(cv_splits)
for i in range(2):
    for j in range(5):
        data = np.concatenate((ones[-i+1], zeros[-j+1]))
        print("Leave out ONES split {}, and Leave out ZEROS split {}".format(i,j))
        print("train on: ", data)
        print("test on: ", np.concatenate((ones[i], zeros[j])))



Leave out ONES split 0, and Leave out ZEROS split 0
train on:  [1 1 0 0 0 0 0 0]
test on:  [1 1 1 0 0 0 0 0 0]
Leave out ONES split 0, and Leave out ZEROS split 1
train on:  [1 1 0 0 0 0 0 0]
...
Leave out ONES split 1, and Leave out ZEROS split 4
train on:  [1 1 1 0 0 0 0 0]
test on:  [1 1 0 0 0 0 0]

此方法可以完成将数据拆分成分区，最终将所有分区留出进行测试。需要注意的是，并非所有的统计学习方法都允许加权，因此像CV这样的调整方法对于考虑采样比例是必不可少的。

参考文献：James, G.、Witten, D.、Hastie, T. 和 Tibshirani, R. (2013)。统计学习简介：在 R 中的应用

具有不平衡类的 k 折分层交叉验证

k-fold stratified cross-validation with imbalanced classes

python

machine-learning

scikit-learn

StratifiedShuffleSplit

分层折叠

K 折 CV

具有不平衡 类 的 k 折分层交叉验证

k-fold stratified cross-validation with imbalanced classes

python

machine-learning

scikit-learn

StratifiedShuffleSplit

分层折叠

K 折 CV

具有不平衡类的 k 折分层交叉验证