如何使用 python 中前一周（天）同一天和同一时间的值来估算时间序列数据中的缺失值

Question

我有一个包含时间戳和能源使用列的数据框。一天中的每一分钟都会获取时间戳，即每天总共有 1440 个读数。我在数据框中几乎没有缺失值。

我想用过去两三周同一天同一时间的平均值来估算那些缺失值。这样如果前一周也不见了，我可以使用两周前的值。

这是一个数据示例：

                    mains_1
timestamp   
2013-01-03 00:00:00 155.00
2013-01-03 00:01:00 154.00
2013-01-03 00:02:00 NaN
2013-01-03 00:03:00 154.00
2013-01-03 00:04:00 153.00
... ...
2013-04-30 23:55:00 NaN
2013-04-30 23:56:00 182.00
2013-04-30 23:57:00 181.00
2013-04-30 23:58:00 182.00
2013-04-30 23:59:00 182.00

现在我有这行代码：

df['mains_1'] = (df
    .groupby((df.index.dayofweek * 24) + (df.index.hour) + (df.index.minute / 60))
    .transform(lambda x: x.fillna(x.mean()))
)

所以它的作用是使用整个数据集一天中同一小时的平均使用情况。我希望它更精确，并使用最近两三周的平均值。

Answer 1

您可以 concat 将系列与 shift 放在一个循环中，因为索引对齐将确保它与前几周的同一时间相匹配。然后取mean，用.fillna更新原来的

示例数据

import pandas as pd
import numpy as np

np.random.seed(5)
df = pd.DataFrame(index=pd.date_range('2010-01-01 10:00:00', freq='W', periods=10),
                  data = np.random.choice([1,2,3,4, np.NaN], 10),
                  columns=['mains_1'])
#                     mains_1
#2010-01-03 10:00:00      4.0
#2010-01-10 10:00:00      1.0
#2010-01-17 10:00:00      2.0
#2010-01-24 10:00:00      1.0
#2010-01-31 10:00:00      NaN
#2010-02-07 10:00:00      4.0
#2010-02-14 10:00:00      1.0
#2010-02-21 10:00:00      1.0
#2010-02-28 10:00:00      NaN
#2010-03-07 10:00:00      2.0

代码

# range(4) for previous 3 weeks. 
df1 = pd.concat([df.shift(periods=x, freq='W') for x in range(4)], axis=1)
#                     mains_1  mains_1  mains_1  mains_1
#2010-01-03 10:00:00      4.0      NaN      NaN      NaN
#2010-01-10 10:00:00      1.0      4.0      NaN      NaN
#2010-01-17 10:00:00      2.0      1.0      4.0      NaN
#2010-01-24 10:00:00      1.0      2.0      1.0      4.0
#2010-01-31 10:00:00      NaN      1.0      2.0      1.0
#2010-02-07 10:00:00      4.0      NaN      1.0      2.0
#2010-02-14 10:00:00      1.0      4.0      NaN      1.0
#2010-02-21 10:00:00      1.0      1.0      4.0      NaN
#2010-02-28 10:00:00      NaN      1.0      1.0      4.0
#2010-03-07 10:00:00      2.0      NaN      1.0      1.0
#2010-03-14 10:00:00      NaN      2.0      NaN      1.0
#2010-03-21 10:00:00      NaN      NaN      2.0      NaN
#2010-03-28 10:00:00      NaN      NaN      NaN      2.0

df['mains_1'] = df['mains_1'].fillna(df1.mean(axis=1))

print(df)

                      mains_1
2010-01-03 10:00:00  4.000000
2010-01-10 10:00:00  1.000000
2010-01-17 10:00:00  2.000000
2010-01-24 10:00:00  1.000000
2010-01-31 10:00:00  1.333333
2010-02-07 10:00:00  4.000000
2010-02-14 10:00:00  1.000000
2010-02-21 10:00:00  1.000000
2010-02-28 10:00:00  2.000000
2010-03-07 10:00:00  2.000000

如何使用 python 中前一周（天）同一天和同一时间的值来估算时间序列数据中的缺失值

How to impute missing value in time series data with the value of the same day and time from the previous week(day) in python

python

time-series

missing-data

pandas

示例数据

代码