scikit-learn

Question

我正在使用 scikit-learn 的 LinearRegression() 和时间序列数据，例如

time_in_s              value
1539015300000000000    2.061695
1539016200000000000    40.178125
1539017100000000000    12.276094
...

因为它是单变量情况，所以我希望我的模型是一条直线，如 y=m*x+c。当我这样做时

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.time_in_s,
                                                    df.value,
                                                    test_size=0.3,
                                                    random_state=0,
                                                    shuffle=False)

regressor = LinearRegression().fit(X_train, y_train)

y_pred_train = regressor.predict(X_train)
y_pred_test = regressor.predict(X_test)
[...]

我得到了预期的直线：. If I use shuffle=True though, I get a curve 。

我很难理解 shuffle 在这里做了什么，为什么我可以学到不同于具有一个特征的直线的东西。我将不胜感激。

编辑：这是模型的属性

>>> #shuffle=False
>>> print(f"{regressor.coef_}")
[-1.6e-16]
>>> print(f"{regressor.intercept_}")
272.0575589244862

>>> #shuffle=True
>>> print(f"{regressor.coef_}")
[-7.76e-17]
>>> print(f"{regressor.intercept_}")
143.9711420915541

以及绘图：

start = 61000
stop = 61500

fig, ax1 = plt.subplots(figsize=(15, 5))

color='tab:red'
plt.plot(df.index[start:train_length].values.reshape(-1, 1),
         df.value[start:train_length].values.reshape(-1, 1),
         color=color)
color='tab:blue'
plt.plot(df.index[train_length:stop].values.reshape(-1, 1),
         df.value[train_length:stop].values.reshape(-1, 1),
         color=color)
color='tab:green'
plt.plot(df.index[start:train_length].values.reshape(-1, 1),
         y_pred_train[start:],
         color=color,
         linestyle='dashed')
plt.plot(df.index[train_length:stop].values.reshape(-1, 1),
         y_pred_test[:stop - train_length],
         color=color,
         linestyle='dashed')

ax1.tick_params(axis='y')
ax1.tick_params(axis='x')

Answer 1

你能试着打印出你的 regressor.coef_ 和 regressor.intercept_ 这两种情况吗？另外你是如何绘制数据的？如果您的输入是一维的，线性回归只能给您 1 个权重和 1 个偏差。 shuffle 参数只会打乱你传递给它的数据，这不能使你的模型更高维。

Answer 2

你没有得到曲线。如果您在 help page 中检查 train_test_split，它会显示：

shuffle bool, default=True Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

我假设你的数据是根据 df.time_in_s 排序的，所以你是运行一个回归模型，对你的数据的前 70% 进行预测，如果你不这样做的话t 随机播放。

在 shuffle=True 中，行的顺序被交换，您正在随机获取 70% 的数据并预测另外 30% 的数据，而不考虑时间。您没有显示绘图代码，但我猜您是按顺序绘制了原始数据框，并将预测放在最上面，因此您得到了这条模糊线。

scikit-learn - LinearRegression() 可以使用一个特征学习不同于直线的东西吗？

scikit-learn - can a LinearRegression() learn something different to a straight line using one feature?

python

linear-regression