我什么时候应该在 StratifiedKFold 中洗牌

When should I shuffle in StratifiedKFold

我阅读了一些关于各种 CV 方法的帖子。但我不明白的是为什么在函数中打乱数据会显着提高准确性以及何时这样做是正确的。

在我的大小为 921 *10080 的时间序列数据集中,其中每一行是一个区域中特定位置的水温时间序列,最后两列是带有 2 组的标签,即。高风险(水中细菌含量高)和低风险(水中细菌含量低),根据我是否设置 "shuffle=True"(achieved accuracy of around 75%)accuracy of 50% 设置 "shuffle=False" 时,准确性差异很大=15=]如下图:

n_folds = 5
skf = StratifiedKFold(n_splits=n_folds, shuffle=True)

sklearn 文档说明如下:

A note on shuffling

If the data ordering is not arbitrary (e.g. samples with the same class label are contiguous), shuffling it first may be essential to get a meaningful cross- validation result. However, the opposite may be true if the samples are not independently and identically distributed. For example, if samples correspond to news articles, and are ordered by their time of publication, then shuffling the data will likely lead to a model that is overfit and an inflated validation score: it will be tested on samples that are artificially similar (close in time) to training samples.

Some cross validation iterators, such as KFold, have an inbuilt option to shuffle the data indices before splitting them. Note that:

• This consumes less memory than shuffling the data directly.

• By default no shuffling occurs, including for the (stratified) K fold cross- validation performed by specifying cv=some_integer to cross_val_score, grid search, etc. Keep in mind that train_test_split still returns a random split.

• The random_state parameter defaults to None, meaning that the shuffling will be different every time KFold(..., shuffle=True) is iterated. However, GridSearchCV will use the same shuffling for each set of parameters validated by a single call to its fit method.

• To get identical results for each split, set random_state to an integer.

我不确定我是否正确解释了文档 - 非常感谢您的解释。另外,我有几个问题:

1)为什么洗牌后准确率会有这么大的提升?我过度拟合了吗?我应该什么时候洗牌?

2)鉴于所有样本都是从同一地区收集的,它们可能不是独立的。这对洗牌有何影响?洗牌还有效吗?

3) 洗牌是否将标签与其对应的 X 数据分开? (答案更新:否。洗牌不会将标签与其相应的 X 数据分开)

感谢

你的问题很棘手,可能放在 here 更好。

In my times series dataset of size 921 *10080 where each row is a time series of water temperature of a particular location in an area and the last column being the label with 2 groups

你不是在使用时间序列期货的分类问题吗?您正在使用因变量(水温的时间序列)来预测标签。对我来说这听起来很冒险,我认为预测标签的机会不大。只需考虑一种情况:

Location  Time1 Time2 Time3  Label
A         3       2    1      1
B         100     99   98     1
C         98      99   100    0

所以在这个例子中,标签 1 是一个下降的时间序列,标签 0 是一个上升的时间序列,但我敢打赌,如果不连接列的趋势组件,每个分类器都会遇到学习问题。

回到你的问题,这可以帮助你理解洗牌:

在处理时间序列数据时,您认为改组会提高准确性是正确的。原因是因为洗牌训练集会导致它包含与测试集中找到的样本非常相似的样本。

例如,如果您从 2010 年到 2019 年训练了一个模型,然后在 2020 年进行了预测,那么所有的测试集样本都会在时间上与训练期分开,因此不会有信息泄漏。现在假设 2020 年发生了一次极端事件,您对数据进行了洗牌。训练集现在将包含来自某些传感器的该极端事件的样本,然后在测试集中它将学习为该期间的其他传感器预测类似的标签。这是训练集和测试集之间的信息泄漏。