OLS Rolling regression in Python Error - IndexError: index out of bounds

Question

对于我的评估，我想运行滚动例如 3 window OLS regression estimation 对于在 this link (https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk 中找到的数据集），格式如下。我数据集中的第三列 (Y) 是我的真实值——这就是我想要预测（估计）的值。

 time     X   Y
0.000543  0  10
0.000575  0  10
0.041324  1  10
0.041331  2  10
0.041336  3  10
0.04134   4  10
  ...
9.987735  55 239
9.987739  56 239
9.987744  57 239
9.987749  58 239
9.987938  59 239

使用简单的 OLS regression estimation，我已经用下面的脚本试过了。

# /usr/bin/python -tt

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('estimated_pred.csv')

model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['X']], 
                               window_type='rolling', window=3, intercept=True)
df['Y_hat'] = model.y_predict

print(df['Y_hat'])
print (model.summary)
df.plot.scatter(x='X', y='Y', s=0.1)

但是，使用 statsmodels 或 scikit-learn 似乎是超越简单回归的不错选择。我尝试使用 statsmodels 使以下脚本工作，但返回 IndexError: index out of bounds 具有 attached 数据集的更高子集（例如，超过 1000 行的数据集）。

# /usr/bin/python -tt
import pandas as pd
import numpy as np
import statsmodels.api as sm


df=pd.read_csv('estimated_pred.csv')    
df=df.dropna() # to drop nans in case there are any
window = 3
#print(df.index) # to print index
df['a']=None #constant
df['b1']=None #beta1
df['b2']=None #beta2
for i in range(window,len(df)):
    temp=df.iloc[i-window:i,:]
    RollOLS=sm.OLS(temp.loc[:,'Y'],sm.add_constant(temp.loc[:,['time','X']])).fit()
    df.iloc[i,df.columns.get_loc('a')]=RollOLS.params[0]
    df.iloc[i,df.columns.get_loc('b1')]=RollOLS.params[1]
    df.iloc[i,df.columns.get_loc('b2')]=RollOLS.params[2]

#The following line gives us predicted values in a row, given the PRIOR row's estimated parameters
df['predicted']=df['a'].shift(1)+df['b1'].shift(1)*df['time']+df['b2'].shift(1)*df['X']

print(df['predicted'])
#print(df['b2'])

#print(RollOLS.predict(sm.add_constant(predict_x)))

print(temp)

最后，我想做一个Y的预测（即根据X的前3个rolling值来预测Y的当前值。我们如何做到这一点在 Pandas 版本 0.20.0 中删除了对 pd.stats.ols.MovingOLS 使用 statsmodels 或 scikit-learn，因为我找不到任何参考？

Answer 1

我想我发现了你的问题：从 sm.add_constant 的 documentation 中，有一个名为 has_constant 的参数，您需要将其设置为 add（默认为 skip）。

has_constant : str {'raise', 'add', 'skip'} Behavior if ``data'' already has a constant. The default will return data without adding another constant. If 'raise', will raise an error if a constant is present. Using 'add' will duplicate the constant, if one is present. Has no effect for structured or recarrays. There is no checking for a constant in this case.

本质上，对于循环的迭代，您的变量 time 在子集中是常量，因此该函数没有添加常量，因此 RollOLS.params 只有 2 个条目。

temp
Out[12]: 
        time   X     Y      a           b1           b2
541  0.16182  13  20.0  19.49      3.15289 -1.26116e-05
542  0.16182  14  20.0     20            0  7.10543e-15
543  0.16182  15  20.0     20 -7.45058e-09            0

sm.add_constant(temp.loc[:,['time','X']])
Out[13]: 
        time   X
541  0.16182  13
542  0.16182  14
543  0.16182  15

sm.add_constant(temp.loc[:,['time','X']], has_constant = 'add')
Out[14]: 
     const     time   X
541      1  0.16182  13
542      1  0.16182  14
543      1  0.16182  15

所以如果你在 sm.add_constant 函数中有 has_constant = 'add'，错误就会消失，但是你会在解释变量中有两个线性相关的列，这使得矩阵不可逆，因此回归没有意义。

OLS Rolling regression in Python Error - IndexError: index out of bounds

OLS Rolling regression in Python Error - IndexError: index out of bounds

python

numpy

python-3.x

scikit-learn

statsmodels