为什么我的时间序列分析会得到如此奇怪的结果？

Question

我正在尝试建立一个模型来预测一家外卖咖啡馆的每日订单数量。这是数据，

在这里您可以看到两个主要高峰：它们是假期 -- 分别是 2 月 14 日和 3 月 8 日。此外，您还可以看到周期为 7 的明显季节性：人们在周末订购更多，而在工作日订购更少。

Dickey-Fuller 检验表明序列不稳定，p 值 = 0.152
然后我决定应用 Box-Cox 变换，因为偏差看起来不均匀。之后，Dickey-Fuller 检验的 p 值为 0.222，但变换后的序列现在看起来像一个正弦曲线，

然后我应用了这样的季节性差异：

data["counts_diff"] = data.counts_box - data.counts_box.shift(7)
plt.figure(figsize=(15,10))
sm.tsa.seasonal_decompose(data.counts_diff[7:]).plot()
p_value = sm.tsa.stattools.adfuller(data.counts_diff[7:])[1]

现在 p 值约为 10e-5 且序列平稳

然后我绘制了ACF和PACF来选择初始参数进行网格搜索，

常识和我知道的规则告诉我选择，

Q = 1
q = 4
P = 3
p = 6

d = 0
D = 1

模型查找代码：

ps = range(0, p+1)
d=0
qs = range(0, q+1)
Ps = range(0, P+1)
D=1
Qs = range(0, Q+1)

parameters_list = []

for p in ps:
    for q in qs:
        for P in Ps:
            for Q in Qs:
                parameters_list.append((p,q,P,Q))
len(parameters_list)

%%time
from IPython.display import clear_output
results = []
best_aic = float("inf")

i = 1
for param in parameters_list:
    print("counting {}/{}...".format(i, len(parameters_list)))
    i += 1
    try:
        model=sm.tsa.statespace.SARIMAX(data.counts_diff[7:], order=(param[0], d, param[1]), 
                                        seasonal_order=(param[2], D, param[3], 7)).fit(disp=-1)

    except ValueError:
        print('wrong parameters:', param)
        continue
    except LinAlgError:
        print('LU decomposition error:', param)
        continue
    finally:
        clear_output()

    aic = model.aic
    if aic < best_aic:
        best_model = model
        best_aic = aic
        best_param = param
    results.append([param, model.aic])

经过网格搜索，我明白了，

但是当我绘制它时，它在零处显示一条恒定线，

残差无偏差、无趋势且不自相关，

绘图代码在这里：

# inverse Box-Cox transformation
def invboxcox(y, lmbda):
    if lmbda == 0:
        return np.exp(y)
    else:
        return np.exp(np.log(lmbda*y+1) / lmbda)

data["model"] = invboxcox(best_model.fittedvalues, lmbda)
plt.figure(figsize(15,5))
plt.ylabel('Orders count')
data.counts.plot()
#pylab.show()
data.model.plot(color='r')
plt.ylabel('Model explanation')
pylab.show()

如果我取消注释该行，情节如下所示，

我错过了什么？我应该考虑变换后的系列的正弦曲线形状吗？为什么比例如此不同？

还有代码，

data["model"] = invboxcox(best_model.fittedvalues, lmbda)  
plt.figure(figsize(15,5))  
(data.counts_diff + 1).plot()  
#pylab.show()  
data.model.plot(color='r')  
pylab.show()

绘制两个非常相似的图，

所以，据我所知，问题出在反向转换中。

Answer 1

让我们首先导入要点，加载数据并转换系列，

import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
from statsmodels.base.transform import BoxCox

# Load data
df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)

# Transformation
box_cox = BoxCox()
y, lmbda = box_cox.transform_boxcox(df['counts'])

在 Box-Cox 变换你的级数后，我测试了它的单位根，

>>>print(sm.tsa.kpss(y)[1])
0.0808334102754407

而且，

>>>print(sm.tsa.adfuller(y)[1])
0.18415817136548102

虽然根据 ADF 测试并非完全静止，但 KPSS 测试并不一致。目视检查似乎表明它可能是静止的 'enough'。让我们考虑一个模型，

df['counts'] = y

model = sm.tsa.SARIMAX(df['counts'], None, (1, 0, 1), (2, 1, 1, 7))
result = model.fit(method='bfgs')

而且，

>>>print(result.summary())
Optimization terminated successfully.
         Current function value: -3.505128
         Iterations: 33
         Function evaluations: 41
         Gradient evaluations: 41
                                 Statespace Model Results                                
=========================================================================================
Dep. Variable:                            counts   No. Observations:                  346
Model:             SARIMAX(1, 0, 1)x(2, 1, 1, 7)   Log Likelihood                1212.774
Date:                           Wed, 24 Jul 2019   AIC                          -2413.549
Time:                                   13:37:19   BIC                          -2390.593
Sample:                               07-01-2018   HQIC                         -2404.401
                                    - 06-11-2019                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.8699      0.052     16.691      0.000       0.768       0.972
ma.L1         -0.5811      0.076     -7.621      0.000      -0.731      -0.432
ar.S.L7        0.0544      0.056      0.963      0.335      -0.056       0.165
ar.S.L14       0.0987      0.060      1.654      0.098      -0.018       0.216
ma.S.L7       -0.9520      0.036    -26.637      0.000      -1.022      -0.882
sigma2      4.385e-05   2.44e-06     17.975      0.000    3.91e-05    4.86e-05
===================================================================================
Ljung-Box (Q):                       34.12   Jarque-Bera (JB):                68.51
Prob(Q):                              0.73   Prob(JB):                         0.00
Heteroskedasticity (H):               1.57   Skew:                             0.08
Prob(H) (two-sided):                  0.02   Kurtosis:                         5.20
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Ljung-Box 结果似乎表明残差不表现出自相关性 - 这很好！让我们对数据和结果进行逆变换，并绘制拟合图，

# Since the first 7 days were back-filled
y_hat = result.fittedvalues[7:]

# Inverse transformations
y_hat_inv = pd.DataFrame(box_cox.untransform_boxcox(y_hat, lmbda),
                         index=y_hat.index)

y_inv = pd.DataFrame(box_cox.untransform_boxcox(y, lmbda),
                     index=df.index)

# Plot fitted values with data
_, ax = plt.subplots()
y_inv.plot(ax=ax)
y_hat_inv.plot(ax=ax)

plt.legend(['Data', 'Fitted values'])
plt.show()

我在哪里，

看起来还不错！

为什么我的时间序列分析会得到如此奇怪的结果？

Why do I get such strange results for time series analysis?

python

statistics

time-series

statsmodels