"train_test_split(shuffle=False)" 和 "TimeSeriesSplit" 有什么区别

Question

当我尝试两种拆分时间序列数据的方法时，测试数据集的预测结果不同。当我检查我的数据时，方法 1 和 2 显示了相同的拆分数据结果。但是，测试数据集的预测结果不同。那么，使用 train_test_split(shuffle=False) 和 TimeSeriesSplit 有什么区别？

以下是我的尝试：拆分数据集

X = df_5T.drop('demand', axis=1)
y = df_5T.demand

train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.167, shuffle=False, random_state=0)

TimeSeriesSplit

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit()
for train_index, val_index in tscv.split(X):
  print("TRAIN:", train_index, "TEST:", val_index)
  X_train, X_val = X.iloc[train_index], X.iloc[val_index]
  y_train, y_val = y.iloc[train_index], y.iloc[val_index]

##Result##
TRAIN: [   0    1    2 ... 1485 1486 1487] TEST: [1488 1489 1490 ... 2973 2974 2975]
TRAIN: [   0    1    2 ... 2973 2974 2975] TEST: [2976 2977 2978 ... 4461 4462 4463]
TRAIN: [   0    1    2 ... 4461 4462 4463] TEST: [4464 4465 4466 ... 5949 5950 5951]
TRAIN: [   0    1    2 ... 5949 5950 5951] TEST: [5952 5953 5954 ... 7437 7438 7439]
TRAIN: [   0    1    2 ... 7437 7438 7439] TEST: [7440 7441 7442 ... 8925 8926 8927]

当我在 timeseriessplit 之后检查数据时，X_train、X_val 紧随其后。

X_train.info()
####
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 7440 entries, 2020-01-01 00:00:00+09:00 to 2020-01-26 19:55:00+09:00
####

X_train.info()
####
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1488 entries, 2020-01-26 20:00:00+09:00 to 2020-01-31 23:55:00+09:00
####

测试数据

X_test = data02_5T.drop('demand', axis=1)
y_test = data02_5T.demand

模型拟合和预测代码

model.fit(X_train, y_train)
y_pred = model.predict(X_val)
y_test = model.predict(X_test)
print(rmsle(y_val, y_pred))
print(rmsle(y_test, y_test))

方法1结果

val: 10.659636522389675
test: 136.65172778040608

方法二结果

val: 12.655831132167329
test: 18.771364489679307

Answer 1

对于时间序列相关的机器学习，你必须使用timeseriesplit()方法。否则就是数据泄露。您将在实验室环境中获得很高的分数，但在现实世界的阶段会失败。比较方法一和方法二已经很清楚了：

Method1 result val: 10.659636522389675(high score as data leaking) test: 136.65172778040608(failed score as your model is not generalized)

Method2 result val: 12.655831132167329 (low compare the method1 as not data leaking) test: 18.771364489679307( reasonable score, less than validation but good in the real world)

为什么方法一失败，是数据泄露？

答案在这里。 train_test_split() 不为时间序列数据设计。它只是随机拆分数据。

比方说，您想训练数据并预测未来。火车数据在 1 月有 5 天的数据。 train_test_split()可以使用1月1日、1月2日、1月3日、1月5日作为训练数据，预测1月4日。在现实世界中，Jan Forth 与 Jan 1、2、3、5 密切相关。这会导致数据泄露。

在现实工作中，你只会预测未来，而不会预测过去。因此，train_test_split() 在预测未来时显示了良好的验证分数，但在实际数据中失败了。

TimeSeriesSplit() 上卷。它拆分基于数据的累积。它确保你 train/predict 未来而不是过去。这样，经过训练的模型可用于预测未来的数据。

这里是关于TimeSeriesSplit()的详细文档。

如果喜欢我的回答，请投上一票吧。

此致，

王勇

"train_test_split(shuffle=False)" 和 "TimeSeriesSplit" 有什么区别

What is difference between "train_test_split(shuffle=False)" and "TimeSeriesSplit"

python

pandas

scikit-learn