为什么我的时间序列分析会得到如此奇怪的结果?
Why do I get such strange results for time series analysis?
我正在尝试建立一个模型来预测一家外卖咖啡馆的每日订单数量。这是数据,
在这里您可以看到两个主要高峰:它们是假期 -- 分别是 2 月 14 日和 3 月 8 日。此外,您还可以看到周期为 7 的明显季节性:人们在周末订购更多,而在工作日订购更少。
Dickey-Fuller 检验表明序列不稳定,p 值 = 0.152
然后我决定应用 Box-Cox 变换,因为偏差看起来不均匀。之后,Dickey-Fuller 检验的 p 值为 0.222
,但变换后的序列现在看起来像一个正弦曲线,
然后我应用了这样的季节性差异:
data["counts_diff"] = data.counts_box - data.counts_box.shift(7)
plt.figure(figsize=(15,10))
sm.tsa.seasonal_decompose(data.counts_diff[7:]).plot()
p_value = sm.tsa.stattools.adfuller(data.counts_diff[7:])[1]
现在 p 值约为 10e-5
且序列平稳
然后我绘制了ACF和PACF来选择初始参数进行网格搜索,
常识和我知道的规则告诉我选择,
Q = 1
q = 4
P = 3
p = 6
d = 0
D = 1
模型查找代码:
ps = range(0, p+1)
d=0
qs = range(0, q+1)
Ps = range(0, P+1)
D=1
Qs = range(0, Q+1)
parameters_list = []
for p in ps:
for q in qs:
for P in Ps:
for Q in Qs:
parameters_list.append((p,q,P,Q))
len(parameters_list)
%%time
from IPython.display import clear_output
results = []
best_aic = float("inf")
i = 1
for param in parameters_list:
print("counting {}/{}...".format(i, len(parameters_list)))
i += 1
try:
model=sm.tsa.statespace.SARIMAX(data.counts_diff[7:], order=(param[0], d, param[1]),
seasonal_order=(param[2], D, param[3], 7)).fit(disp=-1)
except ValueError:
print('wrong parameters:', param)
continue
except LinAlgError:
print('LU decomposition error:', param)
continue
finally:
clear_output()
aic = model.aic
if aic < best_aic:
best_model = model
best_aic = aic
best_param = param
results.append([param, model.aic])
经过网格搜索,我明白了,
但是当我绘制它时,它在零处显示一条恒定线,
残差无偏差、无趋势且不自相关,
绘图代码在这里:
# inverse Box-Cox transformation
def invboxcox(y, lmbda):
if lmbda == 0:
return np.exp(y)
else:
return np.exp(np.log(lmbda*y+1) / lmbda)
data["model"] = invboxcox(best_model.fittedvalues, lmbda)
plt.figure(figsize(15,5))
plt.ylabel('Orders count')
data.counts.plot()
#pylab.show()
data.model.plot(color='r')
plt.ylabel('Model explanation')
pylab.show()
如果我取消注释该行,情节如下所示,
我错过了什么?我应该考虑变换后的系列的正弦曲线形状吗?为什么比例如此不同?
还有代码,
data["model"] = invboxcox(best_model.fittedvalues, lmbda)
plt.figure(figsize(15,5))
(data.counts_diff + 1).plot()
#pylab.show()
data.model.plot(color='r')
pylab.show()
绘制两个非常相似的图,
所以,据我所知,问题出在反向转换中。
让我们首先导入要点,加载数据并转换系列,
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
from statsmodels.base.transform import BoxCox
# Load data
df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Transformation
box_cox = BoxCox()
y, lmbda = box_cox.transform_boxcox(df['counts'])
在 Box-Cox 变换你的级数后,我测试了它的单位根,
>>>print(sm.tsa.kpss(y)[1])
0.0808334102754407
而且,
>>>print(sm.tsa.adfuller(y)[1])
0.18415817136548102
虽然根据 ADF 测试并非完全静止,但 KPSS 测试并不一致。目视检查似乎表明它可能是静止的 'enough'。让我们考虑一个模型,
df['counts'] = y
model = sm.tsa.SARIMAX(df['counts'], None, (1, 0, 1), (2, 1, 1, 7))
result = model.fit(method='bfgs')
而且,
>>>print(result.summary())
Optimization terminated successfully.
Current function value: -3.505128
Iterations: 33
Function evaluations: 41
Gradient evaluations: 41
Statespace Model Results
=========================================================================================
Dep. Variable: counts No. Observations: 346
Model: SARIMAX(1, 0, 1)x(2, 1, 1, 7) Log Likelihood 1212.774
Date: Wed, 24 Jul 2019 AIC -2413.549
Time: 13:37:19 BIC -2390.593
Sample: 07-01-2018 HQIC -2404.401
- 06-11-2019
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.8699 0.052 16.691 0.000 0.768 0.972
ma.L1 -0.5811 0.076 -7.621 0.000 -0.731 -0.432
ar.S.L7 0.0544 0.056 0.963 0.335 -0.056 0.165
ar.S.L14 0.0987 0.060 1.654 0.098 -0.018 0.216
ma.S.L7 -0.9520 0.036 -26.637 0.000 -1.022 -0.882
sigma2 4.385e-05 2.44e-06 17.975 0.000 3.91e-05 4.86e-05
===================================================================================
Ljung-Box (Q): 34.12 Jarque-Bera (JB): 68.51
Prob(Q): 0.73 Prob(JB): 0.00
Heteroskedasticity (H): 1.57 Skew: 0.08
Prob(H) (two-sided): 0.02 Kurtosis: 5.20
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
Ljung-Box 结果似乎表明残差不表现出自相关性 - 这很好!让我们对数据和结果进行逆变换,并绘制拟合图,
# Since the first 7 days were back-filled
y_hat = result.fittedvalues[7:]
# Inverse transformations
y_hat_inv = pd.DataFrame(box_cox.untransform_boxcox(y_hat, lmbda),
index=y_hat.index)
y_inv = pd.DataFrame(box_cox.untransform_boxcox(y, lmbda),
index=df.index)
# Plot fitted values with data
_, ax = plt.subplots()
y_inv.plot(ax=ax)
y_hat_inv.plot(ax=ax)
plt.legend(['Data', 'Fitted values'])
plt.show()
我在哪里,
看起来还不错!
我正在尝试建立一个模型来预测一家外卖咖啡馆的每日订单数量。这是数据,
在这里您可以看到两个主要高峰:它们是假期 -- 分别是 2 月 14 日和 3 月 8 日。此外,您还可以看到周期为 7 的明显季节性:人们在周末订购更多,而在工作日订购更少。
Dickey-Fuller 检验表明序列不稳定,p 值 = 0.152
然后我决定应用 Box-Cox 变换,因为偏差看起来不均匀。之后,Dickey-Fuller 检验的 p 值为 0.222
,但变换后的序列现在看起来像一个正弦曲线,
然后我应用了这样的季节性差异:
data["counts_diff"] = data.counts_box - data.counts_box.shift(7)
plt.figure(figsize=(15,10))
sm.tsa.seasonal_decompose(data.counts_diff[7:]).plot()
p_value = sm.tsa.stattools.adfuller(data.counts_diff[7:])[1]
现在 p 值约为 10e-5
且序列平稳
然后我绘制了ACF和PACF来选择初始参数进行网格搜索,
常识和我知道的规则告诉我选择,
Q = 1
q = 4
P = 3
p = 6
d = 0
D = 1
模型查找代码:
ps = range(0, p+1)
d=0
qs = range(0, q+1)
Ps = range(0, P+1)
D=1
Qs = range(0, Q+1)
parameters_list = []
for p in ps:
for q in qs:
for P in Ps:
for Q in Qs:
parameters_list.append((p,q,P,Q))
len(parameters_list)
%%time
from IPython.display import clear_output
results = []
best_aic = float("inf")
i = 1
for param in parameters_list:
print("counting {}/{}...".format(i, len(parameters_list)))
i += 1
try:
model=sm.tsa.statespace.SARIMAX(data.counts_diff[7:], order=(param[0], d, param[1]),
seasonal_order=(param[2], D, param[3], 7)).fit(disp=-1)
except ValueError:
print('wrong parameters:', param)
continue
except LinAlgError:
print('LU decomposition error:', param)
continue
finally:
clear_output()
aic = model.aic
if aic < best_aic:
best_model = model
best_aic = aic
best_param = param
results.append([param, model.aic])
经过网格搜索,我明白了,
但是当我绘制它时,它在零处显示一条恒定线,
残差无偏差、无趋势且不自相关,
绘图代码在这里:
# inverse Box-Cox transformation
def invboxcox(y, lmbda):
if lmbda == 0:
return np.exp(y)
else:
return np.exp(np.log(lmbda*y+1) / lmbda)
data["model"] = invboxcox(best_model.fittedvalues, lmbda)
plt.figure(figsize(15,5))
plt.ylabel('Orders count')
data.counts.plot()
#pylab.show()
data.model.plot(color='r')
plt.ylabel('Model explanation')
pylab.show()
如果我取消注释该行,情节如下所示,
我错过了什么?我应该考虑变换后的系列的正弦曲线形状吗?为什么比例如此不同?
还有代码,
data["model"] = invboxcox(best_model.fittedvalues, lmbda)
plt.figure(figsize(15,5))
(data.counts_diff + 1).plot()
#pylab.show()
data.model.plot(color='r')
pylab.show()
绘制两个非常相似的图,
所以,据我所知,问题出在反向转换中。
让我们首先导入要点,加载数据并转换系列,
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
from statsmodels.base.transform import BoxCox
# Load data
df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Transformation
box_cox = BoxCox()
y, lmbda = box_cox.transform_boxcox(df['counts'])
在 Box-Cox 变换你的级数后,我测试了它的单位根,
>>>print(sm.tsa.kpss(y)[1])
0.0808334102754407
而且,
>>>print(sm.tsa.adfuller(y)[1])
0.18415817136548102
虽然根据 ADF 测试并非完全静止,但 KPSS 测试并不一致。目视检查似乎表明它可能是静止的 'enough'。让我们考虑一个模型,
df['counts'] = y
model = sm.tsa.SARIMAX(df['counts'], None, (1, 0, 1), (2, 1, 1, 7))
result = model.fit(method='bfgs')
而且,
>>>print(result.summary())
Optimization terminated successfully.
Current function value: -3.505128
Iterations: 33
Function evaluations: 41
Gradient evaluations: 41
Statespace Model Results
=========================================================================================
Dep. Variable: counts No. Observations: 346
Model: SARIMAX(1, 0, 1)x(2, 1, 1, 7) Log Likelihood 1212.774
Date: Wed, 24 Jul 2019 AIC -2413.549
Time: 13:37:19 BIC -2390.593
Sample: 07-01-2018 HQIC -2404.401
- 06-11-2019
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.8699 0.052 16.691 0.000 0.768 0.972
ma.L1 -0.5811 0.076 -7.621 0.000 -0.731 -0.432
ar.S.L7 0.0544 0.056 0.963 0.335 -0.056 0.165
ar.S.L14 0.0987 0.060 1.654 0.098 -0.018 0.216
ma.S.L7 -0.9520 0.036 -26.637 0.000 -1.022 -0.882
sigma2 4.385e-05 2.44e-06 17.975 0.000 3.91e-05 4.86e-05
===================================================================================
Ljung-Box (Q): 34.12 Jarque-Bera (JB): 68.51
Prob(Q): 0.73 Prob(JB): 0.00
Heteroskedasticity (H): 1.57 Skew: 0.08
Prob(H) (two-sided): 0.02 Kurtosis: 5.20
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
Ljung-Box 结果似乎表明残差不表现出自相关性 - 这很好!让我们对数据和结果进行逆变换,并绘制拟合图,
# Since the first 7 days were back-filled
y_hat = result.fittedvalues[7:]
# Inverse transformations
y_hat_inv = pd.DataFrame(box_cox.untransform_boxcox(y_hat, lmbda),
index=y_hat.index)
y_inv = pd.DataFrame(box_cox.untransform_boxcox(y, lmbda),
index=df.index)
# Plot fitted values with data
_, ax = plt.subplots()
y_inv.plot(ax=ax)
y_hat_inv.plot(ax=ax)
plt.legend(['Data', 'Fitted values'])
plt.show()
我在哪里,
看起来还不错!