更改 Pandas 系列中具有公差的频率时间序列,但保留原始日期,删除具有相同值的重复新频率日期

Change frequency timeseries in Pandas Series with tolerance, but keep original dates, removing duplicated new frequency dates with same values

我有不同测量频率的时间序列数据。我想将数据转换为或多或少几天的频率。生成的时间序列可能是不规则的。

比如我有这个时间序列:

Date Value
2017-02-16 26.17000
2017-02-27 26.28000
2017-03-13 26.30000
2017-03-29 26.23000
2017-04-14 26.19000
2017-04-26 26.06000
2017-05-13 26.06000
2017-05-27 25.65000
2017-06-16 25.29000
2017-07-05 25.25000
2017-07-14 25.48000
2017-07-26 25.57000
2017-08-17 25.16000
2017-08-28 25.33000
2017-09-12 25.68235
2017-09-13 25.83799
2017-09-14 25.76669
2017-09-15 25.85253
2017-09-16 25.82017
2017-09-17 25.78362
2017-09-18 25.88422
2017-09-19 25.89594
2017-09-20 25.85522
2017-09-21 25.83583
2017-09-22 25.80082
2017-09-23 25.80076
2017-09-24 25.79209
2017-09-25 25.80632
2017-09-26 25.77773
2017-09-27 25.76311

一开始测量的频率大约为 14 天。后来,频率是每天。我想将其更改为大约 14 天的频率,但我想保留日期。

我试过这个:

serie.reindex(index=serie.asfreq('14d').index,method='nearest',tolerance=datetime.timedelta(3))

结果我得到了这个:

Date Value
2017-02-16 26.17000
2017-03-02 26.28000
2017-03-16 26.30000
2017-03-30 26.23000
2017-04-13 26.19000
2017-04-27 26.06000
2017-05-11 26.06000
2017-05-25 25.65000
2017-06-08 NaN
2017-06-22 NaN
2017-07-06 25.25000
2017-07-20 NaN
2017-08-03 NaN
2017-08-17 25.16000
2017-08-31 25.33000
2017-09-14 25.76669
2017-09-28 25.73150

这或多或少是我想要的。 “值”列中的值就是我要查找的值。只有我想要与值对应的原始日期。我怎样才能做到这一点?非常感谢您!这是我想要的结果:

Date Value
2017-02-16 26.17000
2017-02-27 26.28000
2017-03-13 26.30000
2017-03-29 26.23000
2017-04-14 26.19000
2017-04-26 26.06000
2017-05-13 26.06000
2017-05-27 25.65000
2017-06-08 NaN
2017-06-22 NaN
2017-07-05 25.25000
2017-07-20 NaN
2017-08-03 NaN
2017-08-17 25.16000
2017-08-28 25.33000
2017-09-14 25.76669
2017-09-28 25.73150

我们可以构建一个包含重新索引的行和原始行的中间工作数据框,以方便将旧索引中的日期复制到新索引中的日期。然后,筛选行并复制所选索引的日期。

步骤 1: 构建包含重新索引行和原始行的数据框:

我们可以使用Index.union得到重建索引和原始索引的并集,如下:

idx_new = serie.asfreq('14d').index
idx_old = serie.index
idx_all = idx_new.union(idx_old)

tolerance = 3

serie_all = serie.reindex(index=idx_all, method='nearest', tolerance=datetime.timedelta(tolerance))

步骤 2: 筛选行并复制所选索引的日期:

让我们使用 numpy.select() 来过滤多个条件。然后,使用 .loc:

仅保留索引不是 NaN/NaT 的行

筛选条件:

  • 对于不在新索引中的日期,掩码到 NaT 以丢弃
  • 对于其前一个日期条目在原始索引中的日期,并且列 Value 具有相同的值,2 个日期的差异小于或等于公差(3 天)==> 更改重新索引的日期到前一个日期条目
  • 类似地检查紧随其后的日期条目 ==> 将重新索引的日期更改为紧随其后的日期条目
  • 否则,保留新的重建索引日期索引
condlist = [~ serie_all.index.isin(idx_new),
            serie_all.index.to_series().shift().isin(idx_old) & serie_all['Value'].eq(serie_all['Value'].shift()) & serie_all.index.to_series().diff().dt.days.le(tolerance),
            serie_all.index.to_series().shift(-1).isin(idx_old) & serie_all['Value'].eq(serie_all['Value'].shift(-1)) & serie_all.index.to_series().diff(-1).dt.days.abs().le(tolerance),
            True
           ]

choicelist = [pd.NaT,
              serie_all.index.to_series().shift(),
              serie_all.index.to_series().shift(-1),
              serie_all.index,
             ]

# Change date index values based on conditions
serie_all.index = pd.to_datetime(np.select(condlist, choicelist))

# Keep only non-NaT rows
serie_final = serie_all.loc[serie_all.index.notna()].rename_axis(index='Date')

结果:

print(serie_final)


               Value
Date                
2017-02-16  26.17000
2017-02-27  26.28000
2017-03-13  26.30000
2017-03-29  26.23000
2017-04-14  26.19000
2017-04-26  26.06000
2017-05-13  26.06000
2017-05-27  25.65000
2017-06-08       NaN
2017-06-22       NaN
2017-07-05  25.25000
2017-07-20       NaN
2017-08-03       NaN
2017-08-17  25.16000
2017-08-28  25.33000
2017-09-14  25.76669

数据设置

data = {'Value': {pd.Timestamp('2017-02-16 00:00:00'): 26.17,
  pd.Timestamp('2017-02-27 00:00:00'): 26.28,
  pd.Timestamp('2017-03-13 00:00:00'): 26.3,
  pd.Timestamp('2017-03-29 00:00:00'): 26.23,
  pd.Timestamp('2017-04-14 00:00:00'): 26.19,
  pd.Timestamp('2017-04-26 00:00:00'): 26.06,
  pd.Timestamp('2017-05-13 00:00:00'): 26.06,
  pd.Timestamp('2017-05-27 00:00:00'): 25.65,
  pd.Timestamp('2017-06-16 00:00:00'): 25.29,
  pd.Timestamp('2017-07-05 00:00:00'): 25.25,
  pd.Timestamp('2017-07-14 00:00:00'): 25.48,
  pd.Timestamp('2017-07-26 00:00:00'): 25.57,
  pd.Timestamp('2017-08-17 00:00:00'): 25.16,
  pd.Timestamp('2017-08-28 00:00:00'): 25.33,
  pd.Timestamp('2017-09-12 00:00:00'): 25.68235,
  pd.Timestamp('2017-09-13 00:00:00'): 25.83799,
  pd.Timestamp('2017-09-14 00:00:00'): 25.76669,
  pd.Timestamp('2017-09-15 00:00:00'): 25.85253,
  pd.Timestamp('2017-09-16 00:00:00'): 25.82017,
  pd.Timestamp('2017-09-17 00:00:00'): 25.78362,
  pd.Timestamp('2017-09-18 00:00:00'): 25.88422,
  pd.Timestamp('2017-09-19 00:00:00'): 25.89594,
  pd.Timestamp('2017-09-20 00:00:00'): 25.85522,
  pd.Timestamp('2017-09-21 00:00:00'): 25.83583,
  pd.Timestamp('2017-09-22 00:00:00'): 25.80082,
  pd.Timestamp('2017-09-23 00:00:00'): 25.80076,
  pd.Timestamp('2017-09-24 00:00:00'): 25.79209,
  pd.Timestamp('2017-09-25 00:00:00'): 25.80632,
  pd.Timestamp('2017-09-26 00:00:00'): 25.77773,
  pd.Timestamp('2017-09-27 00:00:00'): 25.76311}}  
  
serie = pd.DataFrame(data).rename_axis(index='Date')