如何使用 python 中前一周(天)同一天和同一时间的值来估算时间序列数据中的缺失值

How to impute missing value in time series data with the value of the same day and time from the previous week(day) in python

我有一个包含时间戳和能源使用列的数据框。一天中的每一分钟都会获取时间戳,即每天总共有 1440 个读数。我在数据框中几乎没有缺失值。

我想用过去两三周同一天同一时间的平均值来估算那些缺失值。这样如果前一周也不见了,我可以使用两周前的值。

这是一个数据示例:

                    mains_1
timestamp   
2013-01-03 00:00:00 155.00
2013-01-03 00:01:00 154.00
2013-01-03 00:02:00 NaN
2013-01-03 00:03:00 154.00
2013-01-03 00:04:00 153.00
... ...
2013-04-30 23:55:00 NaN
2013-04-30 23:56:00 182.00
2013-04-30 23:57:00 181.00
2013-04-30 23:58:00 182.00
2013-04-30 23:59:00 182.00

现在我有这行代码:

df['mains_1'] = (df
    .groupby((df.index.dayofweek * 24) + (df.index.hour) + (df.index.minute / 60))
    .transform(lambda x: x.fillna(x.mean()))
)

所以它的作用是使用整个数据集一天中同一小时的平均使用情况。我希望它更精确,并使用最近两三周的平均值。

您可以 concat 将系列与 shift 放在一个循环中,因为索引对齐将确保它与前几周的同一时间相匹配。然后取mean,用.fillna更新原来的

示例数据

import pandas as pd
import numpy as np

np.random.seed(5)
df = pd.DataFrame(index=pd.date_range('2010-01-01 10:00:00', freq='W', periods=10),
                  data = np.random.choice([1,2,3,4, np.NaN], 10),
                  columns=['mains_1'])
#                     mains_1
#2010-01-03 10:00:00      4.0
#2010-01-10 10:00:00      1.0
#2010-01-17 10:00:00      2.0
#2010-01-24 10:00:00      1.0
#2010-01-31 10:00:00      NaN
#2010-02-07 10:00:00      4.0
#2010-02-14 10:00:00      1.0
#2010-02-21 10:00:00      1.0
#2010-02-28 10:00:00      NaN
#2010-03-07 10:00:00      2.0

代码

# range(4) for previous 3 weeks. 
df1 = pd.concat([df.shift(periods=x, freq='W') for x in range(4)], axis=1)
#                     mains_1  mains_1  mains_1  mains_1
#2010-01-03 10:00:00      4.0      NaN      NaN      NaN
#2010-01-10 10:00:00      1.0      4.0      NaN      NaN
#2010-01-17 10:00:00      2.0      1.0      4.0      NaN
#2010-01-24 10:00:00      1.0      2.0      1.0      4.0
#2010-01-31 10:00:00      NaN      1.0      2.0      1.0
#2010-02-07 10:00:00      4.0      NaN      1.0      2.0
#2010-02-14 10:00:00      1.0      4.0      NaN      1.0
#2010-02-21 10:00:00      1.0      1.0      4.0      NaN
#2010-02-28 10:00:00      NaN      1.0      1.0      4.0
#2010-03-07 10:00:00      2.0      NaN      1.0      1.0
#2010-03-14 10:00:00      NaN      2.0      NaN      1.0
#2010-03-21 10:00:00      NaN      NaN      2.0      NaN
#2010-03-28 10:00:00      NaN      NaN      NaN      2.0

df['mains_1'] = df['mains_1'].fillna(df1.mean(axis=1))

print(df)

                      mains_1
2010-01-03 10:00:00  4.000000
2010-01-10 10:00:00  1.000000
2010-01-17 10:00:00  2.000000
2010-01-24 10:00:00  1.000000
2010-01-31 10:00:00  1.333333
2010-02-07 10:00:00  4.000000
2010-02-14 10:00:00  1.000000
2010-02-21 10:00:00  1.000000
2010-02-28 10:00:00  2.000000
2010-03-07 10:00:00  2.000000