回归分析,使用statsmodels
Regression analysis,using statsmodels
请帮助我获取此代码的输出code.why 这段代码的输出是 nan?!!!我错了什么?
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import matplotlib.pyplot as plt
import math
import datetime as dt
#importing Data
es_url = 'https://www.stoxx.com/document/Indices/Current/HistoricalData/hbrbcpe.txt'
vs_url = 'https://www.stoxx.com/document/Indices/Current/HistoricalData/h_vstoxx.txt'
#creating DataFrame
cols=['SX5P','SX5E','SXXP','SXXE','SXXF','SXXA','DK5f','DKXF']
es=pd.read_csv(es_url,index_col=0,parse_dates=True,sep=';',dayfirst=True,header=None,skiprows=4,names=cols)
vs=pd.read_csv(vs_url,index_col=0,header=2,parse_dates=True,sep=',',dayfirst=True)
data=pd.DataFrame({'EUROSTOXX' : es['SX5E'][es.index > dt.datetime(1999,1,1)]},dtype=float)
data=data.join(pd.DataFrame({'VSTOXX' : vs['V2TX'][vs.index > dt.datetime(1999,1,1)]},dtype=float))
data=data.fillna(method='ffill')
rets=(((data/data.shift(1))-1)*100).round(2)
xdat = rets['EUROSTOXX']
ydat = rets['VSTOXX']
#regression analysis
model = smf.ols('ydat ~ xdat',data=rets).fit()
print model.summary()
问题是,当您计算 rets
时,除以零会导致 inf
。此外,当你使用 shift 时,你有 NaN
s,所以你有缺失值,需要在进行回归之前先以某种方式处理。
使用您的数据浏览此示例并查看:
df = data.loc['2016-03-20':'2016-04-01'].copy()
df 看起来像:
EUROSTOXX VSTOXX
2016-03-21 3048.77 35.6846
2016-03-22 3051.23 35.6846
2016-03-23 3042.42 35.6846
2016-03-24 2986.73 35.6846
2016-03-25 0.00 35.6846
2016-03-28 0.00 35.6846
2016-03-29 3004.87 35.6846
2016-03-30 3044.10 35.6846
2016-03-31 3004.93 35.6846
2016-04-01 2953.28 35.6846
移1除:
df = (((df/df.shift(1))-1)*100).round(2)
打印出来:
EUROSTOXX VSTOXX
2016-03-21 NaN NaN
2016-03-22 0.080688 0.0
2016-03-23 -0.288736 0.0
2016-03-24 -1.830451 0.0
2016-03-25 -100.000000 0.0
2016-03-28 NaN 0.0
2016-03-29 inf 0.0
2016-03-30 1.305547 0.0
2016-03-31 -1.286751 0.0
2016-04-01 -1.718842 0.0
要点:自动移动 1 总是在顶部创建一个 NaN。 0.00 除以 0.00 产生 inf
.
处理缺失值的一种可能解决方案:
...
xdat = rets['EUROSTOXX']
ydat = rets['VSTOXX']
# handle missing values
messed_up_indices = xdat[xdat.isin([-np.inf, np.inf, np.nan]) == True].index
xdat[messed_up_indices] = xdat[messed_up_indices].replace([-np.inf, np.inf], np.nan)
xdat[messed_up_indices] = xdat[messed_up_indices].fillna(xdat.mean())
ydat[messed_up_indices] = ydat[messed_up_indices].fillna(0.0)
#regression analysis
model = smf.ols('ydat ~ xdat',data=rets, missing='raise').fit()
print(model.summary())
注意我将 missing='raise'
参数添加到 ols 以查看发生了什么。
最终结果打印出来:
OLS Regression Results
==============================================================================
Dep. Variable: ydat R-squared: 0.259
Model: OLS Adj. R-squared: 0.259
Method: Least Squares F-statistic: 1593.
Date: Wed, 03 Jan 2018 Prob (F-statistic): 5.76e-299
Time: 12:01:14 Log-Likelihood: -13856.
No. Observations: 4554 AIC: 2.772e+04
Df Residuals: 4552 BIC: 2.773e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.1608 0.075 2.139 0.033 0.013 0.308
xdat -1.4209 0.036 -39.912 0.000 -1.491 -1.351
==============================================================================
Omnibus: 4280.114 Durbin-Watson: 2.074
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4021394.925
Skew: -3.446 Prob(JB): 0.00
Kurtosis: 148.415 Cond. No. 2.11
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
请帮助我获取此代码的输出code.why 这段代码的输出是 nan?!!!我错了什么?
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import matplotlib.pyplot as plt
import math
import datetime as dt
#importing Data
es_url = 'https://www.stoxx.com/document/Indices/Current/HistoricalData/hbrbcpe.txt'
vs_url = 'https://www.stoxx.com/document/Indices/Current/HistoricalData/h_vstoxx.txt'
#creating DataFrame
cols=['SX5P','SX5E','SXXP','SXXE','SXXF','SXXA','DK5f','DKXF']
es=pd.read_csv(es_url,index_col=0,parse_dates=True,sep=';',dayfirst=True,header=None,skiprows=4,names=cols)
vs=pd.read_csv(vs_url,index_col=0,header=2,parse_dates=True,sep=',',dayfirst=True)
data=pd.DataFrame({'EUROSTOXX' : es['SX5E'][es.index > dt.datetime(1999,1,1)]},dtype=float)
data=data.join(pd.DataFrame({'VSTOXX' : vs['V2TX'][vs.index > dt.datetime(1999,1,1)]},dtype=float))
data=data.fillna(method='ffill')
rets=(((data/data.shift(1))-1)*100).round(2)
xdat = rets['EUROSTOXX']
ydat = rets['VSTOXX']
#regression analysis
model = smf.ols('ydat ~ xdat',data=rets).fit()
print model.summary()
问题是,当您计算 rets
时,除以零会导致 inf
。此外,当你使用 shift 时,你有 NaN
s,所以你有缺失值,需要在进行回归之前先以某种方式处理。
使用您的数据浏览此示例并查看:
df = data.loc['2016-03-20':'2016-04-01'].copy()
df 看起来像:
EUROSTOXX VSTOXX
2016-03-21 3048.77 35.6846
2016-03-22 3051.23 35.6846
2016-03-23 3042.42 35.6846
2016-03-24 2986.73 35.6846
2016-03-25 0.00 35.6846
2016-03-28 0.00 35.6846
2016-03-29 3004.87 35.6846
2016-03-30 3044.10 35.6846
2016-03-31 3004.93 35.6846
2016-04-01 2953.28 35.6846
移1除:
df = (((df/df.shift(1))-1)*100).round(2)
打印出来:
EUROSTOXX VSTOXX
2016-03-21 NaN NaN
2016-03-22 0.080688 0.0
2016-03-23 -0.288736 0.0
2016-03-24 -1.830451 0.0
2016-03-25 -100.000000 0.0
2016-03-28 NaN 0.0
2016-03-29 inf 0.0
2016-03-30 1.305547 0.0
2016-03-31 -1.286751 0.0
2016-04-01 -1.718842 0.0
要点:自动移动 1 总是在顶部创建一个 NaN。 0.00 除以 0.00 产生 inf
.
处理缺失值的一种可能解决方案:
...
xdat = rets['EUROSTOXX']
ydat = rets['VSTOXX']
# handle missing values
messed_up_indices = xdat[xdat.isin([-np.inf, np.inf, np.nan]) == True].index
xdat[messed_up_indices] = xdat[messed_up_indices].replace([-np.inf, np.inf], np.nan)
xdat[messed_up_indices] = xdat[messed_up_indices].fillna(xdat.mean())
ydat[messed_up_indices] = ydat[messed_up_indices].fillna(0.0)
#regression analysis
model = smf.ols('ydat ~ xdat',data=rets, missing='raise').fit()
print(model.summary())
注意我将 missing='raise'
参数添加到 ols 以查看发生了什么。
最终结果打印出来:
OLS Regression Results
==============================================================================
Dep. Variable: ydat R-squared: 0.259
Model: OLS Adj. R-squared: 0.259
Method: Least Squares F-statistic: 1593.
Date: Wed, 03 Jan 2018 Prob (F-statistic): 5.76e-299
Time: 12:01:14 Log-Likelihood: -13856.
No. Observations: 4554 AIC: 2.772e+04
Df Residuals: 4552 BIC: 2.773e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.1608 0.075 2.139 0.033 0.013 0.308
xdat -1.4209 0.036 -39.912 0.000 -1.491 -1.351
==============================================================================
Omnibus: 4280.114 Durbin-Watson: 2.074
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4021394.925
Skew: -3.446 Prob(JB): 0.00
Kurtosis: 148.415 Cond. No. 2.11
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.