多重预测

Question

我有一个 df，我需要在其中预测未来 7 天内每一天的因变量（数字）。 train 数据如下：

df.head()
Date                   X1                X2             X3    Y
2004-11-20          453.0               654            989  716   # row 1
2004-11-21          716.0               878            886  605
2004-11-22          605.0               433            775  555
2004-11-23          555.0               453            564  680
2004-11-24          680.0               645            734  713

具体而言，对于第 1 行中的日期 2004-11-20，我需要接下来 7 天的每一天的 Y 预测值，而不仅仅是今天（变量 Y），并考虑到要预测从 2004-11-20 开始的第 5 天，我不会获得从 2004-11-20 开始的接下来 4 天的数据。

我一直在考虑再创建 7 个变量（"Y+1day"、"Y+2day" 等等），但我需要每天创建一个训练 df 作为机器学习技术只有 return 一个变量作为输出。有更简单的方法吗？

我正在使用 skikit-learn 库进行建模。

Answer 1

您绝对可以训练一个模型来预测 sklearn 中的多个输出。而且 pandas 非常灵活。在下面的示例中，我将您的日期列转换为日期时间索引，然后使用 shift 实用程序获取更多 Y 值。

import io
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Read from Whosebug artifacts
s = """Date  X1  X2   X3   Y
2004-11-20          453.0               654            989  716  
2004-11-21          716.0               878            886  605
2004-11-22          605.0               433            775  555
2004-11-23          555.0               453            564  680
2004-11-24          680.0               645            734  713"""
text = io.StringIO(s)
df = pd.read_csv(text, sep='\s+')

# Datetime index
df["Date"] = pd.to_datetime(df["Date"], format="%Y/%m/%d")
df = df.set_index("Date")

# Shifting for Y@Day+N   
df['Y1'] = df.shift(1)['Y'] # One day later
df['Y2'] = df.shift(2)['Y'] # Two...

我们必须估算或删除使用 shift 时产生的 NaN。在大型数据集中，这希望只会导致在时间边缘估算或丢弃数据运行ge。例如，如果您想要移动 7 天，您会从数据集中损失 7 天，具体取决于您的数据结构以及您需要移动的方式。

df.dropna(inplace=True) # Drop two rows

train, test = train_test_split(df)
# Get two training rows
trainX = train.drop(["Y", "Y1", "Y2"], axis=1)
trainY = train.drop(["X1", "X2", "X3"], axis=1)

# Get the test row
X = test.drop(["Y", "Y1", "Y2"], axis=1)
Y = test.drop(["X1", "X2", "X3"], axis=1)

现在我们可以从 sklearn 实例化分类器并进行预测。

from sklearn.linear_model import LinearRegression

clf = LinearRegression()
model = clf.fit(trainX, trainY)
model.predict(X) # Array of three numbers
model.score(X, Y) # Predictably abysmal score

所有这些运行对我的 sklearn 版本 0.20.1 来说都很好。现在我当然得到了一个糟糕的分数结果，但是模型确实训练了，并且预测方法 return 对每个 Y 列进行了预测，而评分方法 return 是一个分数。

多重预测

Multiple predictions

python

datetime

prediction

scikit-learn