更改 Pandas 系列中具有公差的频率时间序列,但保留原始日期,删除具有相同值的重复新频率日期
Change frequency timeseries in Pandas Series with tolerance, but keep original dates, removing duplicated new frequency dates with same values
我有不同测量频率的时间序列数据。我想将数据转换为或多或少几天的频率。生成的时间序列可能是不规则的。
比如我有这个时间序列:
Date
Value
2017-02-16
26.17000
2017-02-27
26.28000
2017-03-13
26.30000
2017-03-29
26.23000
2017-04-14
26.19000
2017-04-26
26.06000
2017-05-13
26.06000
2017-05-27
25.65000
2017-06-16
25.29000
2017-07-05
25.25000
2017-07-14
25.48000
2017-07-26
25.57000
2017-08-17
25.16000
2017-08-28
25.33000
2017-09-12
25.68235
2017-09-13
25.83799
2017-09-14
25.76669
2017-09-15
25.85253
2017-09-16
25.82017
2017-09-17
25.78362
2017-09-18
25.88422
2017-09-19
25.89594
2017-09-20
25.85522
2017-09-21
25.83583
2017-09-22
25.80082
2017-09-23
25.80076
2017-09-24
25.79209
2017-09-25
25.80632
2017-09-26
25.77773
2017-09-27
25.76311
一开始测量的频率大约为 14 天。后来,频率是每天。我想将其更改为大约 14 天的频率,但我想保留日期。
我试过这个:
serie.reindex(index=serie.asfreq('14d').index,method='nearest',tolerance=datetime.timedelta(3))
结果我得到了这个:
Date
Value
2017-02-16
26.17000
2017-03-02
26.28000
2017-03-16
26.30000
2017-03-30
26.23000
2017-04-13
26.19000
2017-04-27
26.06000
2017-05-11
26.06000
2017-05-25
25.65000
2017-06-08
NaN
2017-06-22
NaN
2017-07-06
25.25000
2017-07-20
NaN
2017-08-03
NaN
2017-08-17
25.16000
2017-08-31
25.33000
2017-09-14
25.76669
2017-09-28
25.73150
这或多或少是我想要的。 “值”列中的值就是我要查找的值。只有我想要与值对应的原始日期。我怎样才能做到这一点?非常感谢您!这是我想要的结果:
Date
Value
2017-02-16
26.17000
2017-02-27
26.28000
2017-03-13
26.30000
2017-03-29
26.23000
2017-04-14
26.19000
2017-04-26
26.06000
2017-05-13
26.06000
2017-05-27
25.65000
2017-06-08
NaN
2017-06-22
NaN
2017-07-05
25.25000
2017-07-20
NaN
2017-08-03
NaN
2017-08-17
25.16000
2017-08-28
25.33000
2017-09-14
25.76669
2017-09-28
25.73150
我们可以构建一个包含重新索引的行和原始行的中间工作数据框,以方便将旧索引中的日期复制到新索引中的日期。然后,筛选行并复制所选索引的日期。
步骤 1: 构建包含重新索引行和原始行的数据框:
我们可以使用Index.union
得到重建索引和原始索引的并集,如下:
idx_new = serie.asfreq('14d').index
idx_old = serie.index
idx_all = idx_new.union(idx_old)
tolerance = 3
serie_all = serie.reindex(index=idx_all, method='nearest', tolerance=datetime.timedelta(tolerance))
步骤 2: 筛选行并复制所选索引的日期:
让我们使用 numpy.select()
来过滤多个条件。然后,使用 .loc
:
仅保留索引不是 NaN
/NaT
的行
筛选条件:
- 对于不在新索引中的日期,掩码到
NaT
以丢弃
- 对于其前一个日期条目在原始索引中的日期,并且列
Value
具有相同的值,2 个日期的差异小于或等于公差(3 天)==> 更改重新索引的日期到前一个日期条目
- 类似地检查紧随其后的日期条目 ==> 将重新索引的日期更改为紧随其后的日期条目
- 否则,保留新的重建索引日期索引
condlist = [~ serie_all.index.isin(idx_new),
serie_all.index.to_series().shift().isin(idx_old) & serie_all['Value'].eq(serie_all['Value'].shift()) & serie_all.index.to_series().diff().dt.days.le(tolerance),
serie_all.index.to_series().shift(-1).isin(idx_old) & serie_all['Value'].eq(serie_all['Value'].shift(-1)) & serie_all.index.to_series().diff(-1).dt.days.abs().le(tolerance),
True
]
choicelist = [pd.NaT,
serie_all.index.to_series().shift(),
serie_all.index.to_series().shift(-1),
serie_all.index,
]
# Change date index values based on conditions
serie_all.index = pd.to_datetime(np.select(condlist, choicelist))
# Keep only non-NaT rows
serie_final = serie_all.loc[serie_all.index.notna()].rename_axis(index='Date')
结果:
print(serie_final)
Value
Date
2017-02-16 26.17000
2017-02-27 26.28000
2017-03-13 26.30000
2017-03-29 26.23000
2017-04-14 26.19000
2017-04-26 26.06000
2017-05-13 26.06000
2017-05-27 25.65000
2017-06-08 NaN
2017-06-22 NaN
2017-07-05 25.25000
2017-07-20 NaN
2017-08-03 NaN
2017-08-17 25.16000
2017-08-28 25.33000
2017-09-14 25.76669
数据设置
data = {'Value': {pd.Timestamp('2017-02-16 00:00:00'): 26.17,
pd.Timestamp('2017-02-27 00:00:00'): 26.28,
pd.Timestamp('2017-03-13 00:00:00'): 26.3,
pd.Timestamp('2017-03-29 00:00:00'): 26.23,
pd.Timestamp('2017-04-14 00:00:00'): 26.19,
pd.Timestamp('2017-04-26 00:00:00'): 26.06,
pd.Timestamp('2017-05-13 00:00:00'): 26.06,
pd.Timestamp('2017-05-27 00:00:00'): 25.65,
pd.Timestamp('2017-06-16 00:00:00'): 25.29,
pd.Timestamp('2017-07-05 00:00:00'): 25.25,
pd.Timestamp('2017-07-14 00:00:00'): 25.48,
pd.Timestamp('2017-07-26 00:00:00'): 25.57,
pd.Timestamp('2017-08-17 00:00:00'): 25.16,
pd.Timestamp('2017-08-28 00:00:00'): 25.33,
pd.Timestamp('2017-09-12 00:00:00'): 25.68235,
pd.Timestamp('2017-09-13 00:00:00'): 25.83799,
pd.Timestamp('2017-09-14 00:00:00'): 25.76669,
pd.Timestamp('2017-09-15 00:00:00'): 25.85253,
pd.Timestamp('2017-09-16 00:00:00'): 25.82017,
pd.Timestamp('2017-09-17 00:00:00'): 25.78362,
pd.Timestamp('2017-09-18 00:00:00'): 25.88422,
pd.Timestamp('2017-09-19 00:00:00'): 25.89594,
pd.Timestamp('2017-09-20 00:00:00'): 25.85522,
pd.Timestamp('2017-09-21 00:00:00'): 25.83583,
pd.Timestamp('2017-09-22 00:00:00'): 25.80082,
pd.Timestamp('2017-09-23 00:00:00'): 25.80076,
pd.Timestamp('2017-09-24 00:00:00'): 25.79209,
pd.Timestamp('2017-09-25 00:00:00'): 25.80632,
pd.Timestamp('2017-09-26 00:00:00'): 25.77773,
pd.Timestamp('2017-09-27 00:00:00'): 25.76311}}
serie = pd.DataFrame(data).rename_axis(index='Date')
我有不同测量频率的时间序列数据。我想将数据转换为或多或少几天的频率。生成的时间序列可能是不规则的。
比如我有这个时间序列:
Date | Value |
---|---|
2017-02-16 | 26.17000 |
2017-02-27 | 26.28000 |
2017-03-13 | 26.30000 |
2017-03-29 | 26.23000 |
2017-04-14 | 26.19000 |
2017-04-26 | 26.06000 |
2017-05-13 | 26.06000 |
2017-05-27 | 25.65000 |
2017-06-16 | 25.29000 |
2017-07-05 | 25.25000 |
2017-07-14 | 25.48000 |
2017-07-26 | 25.57000 |
2017-08-17 | 25.16000 |
2017-08-28 | 25.33000 |
2017-09-12 | 25.68235 |
2017-09-13 | 25.83799 |
2017-09-14 | 25.76669 |
2017-09-15 | 25.85253 |
2017-09-16 | 25.82017 |
2017-09-17 | 25.78362 |
2017-09-18 | 25.88422 |
2017-09-19 | 25.89594 |
2017-09-20 | 25.85522 |
2017-09-21 | 25.83583 |
2017-09-22 | 25.80082 |
2017-09-23 | 25.80076 |
2017-09-24 | 25.79209 |
2017-09-25 | 25.80632 |
2017-09-26 | 25.77773 |
2017-09-27 | 25.76311 |
一开始测量的频率大约为 14 天。后来,频率是每天。我想将其更改为大约 14 天的频率,但我想保留日期。
我试过这个:
serie.reindex(index=serie.asfreq('14d').index,method='nearest',tolerance=datetime.timedelta(3))
结果我得到了这个:
Date | Value |
---|---|
2017-02-16 | 26.17000 |
2017-03-02 | 26.28000 |
2017-03-16 | 26.30000 |
2017-03-30 | 26.23000 |
2017-04-13 | 26.19000 |
2017-04-27 | 26.06000 |
2017-05-11 | 26.06000 |
2017-05-25 | 25.65000 |
2017-06-08 | NaN |
2017-06-22 | NaN |
2017-07-06 | 25.25000 |
2017-07-20 | NaN |
2017-08-03 | NaN |
2017-08-17 | 25.16000 |
2017-08-31 | 25.33000 |
2017-09-14 | 25.76669 |
2017-09-28 | 25.73150 |
这或多或少是我想要的。 “值”列中的值就是我要查找的值。只有我想要与值对应的原始日期。我怎样才能做到这一点?非常感谢您!这是我想要的结果:
Date | Value |
---|---|
2017-02-16 | 26.17000 |
2017-02-27 | 26.28000 |
2017-03-13 | 26.30000 |
2017-03-29 | 26.23000 |
2017-04-14 | 26.19000 |
2017-04-26 | 26.06000 |
2017-05-13 | 26.06000 |
2017-05-27 | 25.65000 |
2017-06-08 | NaN |
2017-06-22 | NaN |
2017-07-05 | 25.25000 |
2017-07-20 | NaN |
2017-08-03 | NaN |
2017-08-17 | 25.16000 |
2017-08-28 | 25.33000 |
2017-09-14 | 25.76669 |
2017-09-28 | 25.73150 |
我们可以构建一个包含重新索引的行和原始行的中间工作数据框,以方便将旧索引中的日期复制到新索引中的日期。然后,筛选行并复制所选索引的日期。
步骤 1: 构建包含重新索引行和原始行的数据框:
我们可以使用Index.union
得到重建索引和原始索引的并集,如下:
idx_new = serie.asfreq('14d').index
idx_old = serie.index
idx_all = idx_new.union(idx_old)
tolerance = 3
serie_all = serie.reindex(index=idx_all, method='nearest', tolerance=datetime.timedelta(tolerance))
步骤 2: 筛选行并复制所选索引的日期:
让我们使用 numpy.select()
来过滤多个条件。然后,使用 .loc
:
NaN
/NaT
的行
筛选条件:
- 对于不在新索引中的日期,掩码到
NaT
以丢弃 - 对于其前一个日期条目在原始索引中的日期,并且列
Value
具有相同的值,2 个日期的差异小于或等于公差(3 天)==> 更改重新索引的日期到前一个日期条目 - 类似地检查紧随其后的日期条目 ==> 将重新索引的日期更改为紧随其后的日期条目
- 否则,保留新的重建索引日期索引
condlist = [~ serie_all.index.isin(idx_new),
serie_all.index.to_series().shift().isin(idx_old) & serie_all['Value'].eq(serie_all['Value'].shift()) & serie_all.index.to_series().diff().dt.days.le(tolerance),
serie_all.index.to_series().shift(-1).isin(idx_old) & serie_all['Value'].eq(serie_all['Value'].shift(-1)) & serie_all.index.to_series().diff(-1).dt.days.abs().le(tolerance),
True
]
choicelist = [pd.NaT,
serie_all.index.to_series().shift(),
serie_all.index.to_series().shift(-1),
serie_all.index,
]
# Change date index values based on conditions
serie_all.index = pd.to_datetime(np.select(condlist, choicelist))
# Keep only non-NaT rows
serie_final = serie_all.loc[serie_all.index.notna()].rename_axis(index='Date')
结果:
print(serie_final)
Value
Date
2017-02-16 26.17000
2017-02-27 26.28000
2017-03-13 26.30000
2017-03-29 26.23000
2017-04-14 26.19000
2017-04-26 26.06000
2017-05-13 26.06000
2017-05-27 25.65000
2017-06-08 NaN
2017-06-22 NaN
2017-07-05 25.25000
2017-07-20 NaN
2017-08-03 NaN
2017-08-17 25.16000
2017-08-28 25.33000
2017-09-14 25.76669
数据设置
data = {'Value': {pd.Timestamp('2017-02-16 00:00:00'): 26.17,
pd.Timestamp('2017-02-27 00:00:00'): 26.28,
pd.Timestamp('2017-03-13 00:00:00'): 26.3,
pd.Timestamp('2017-03-29 00:00:00'): 26.23,
pd.Timestamp('2017-04-14 00:00:00'): 26.19,
pd.Timestamp('2017-04-26 00:00:00'): 26.06,
pd.Timestamp('2017-05-13 00:00:00'): 26.06,
pd.Timestamp('2017-05-27 00:00:00'): 25.65,
pd.Timestamp('2017-06-16 00:00:00'): 25.29,
pd.Timestamp('2017-07-05 00:00:00'): 25.25,
pd.Timestamp('2017-07-14 00:00:00'): 25.48,
pd.Timestamp('2017-07-26 00:00:00'): 25.57,
pd.Timestamp('2017-08-17 00:00:00'): 25.16,
pd.Timestamp('2017-08-28 00:00:00'): 25.33,
pd.Timestamp('2017-09-12 00:00:00'): 25.68235,
pd.Timestamp('2017-09-13 00:00:00'): 25.83799,
pd.Timestamp('2017-09-14 00:00:00'): 25.76669,
pd.Timestamp('2017-09-15 00:00:00'): 25.85253,
pd.Timestamp('2017-09-16 00:00:00'): 25.82017,
pd.Timestamp('2017-09-17 00:00:00'): 25.78362,
pd.Timestamp('2017-09-18 00:00:00'): 25.88422,
pd.Timestamp('2017-09-19 00:00:00'): 25.89594,
pd.Timestamp('2017-09-20 00:00:00'): 25.85522,
pd.Timestamp('2017-09-21 00:00:00'): 25.83583,
pd.Timestamp('2017-09-22 00:00:00'): 25.80082,
pd.Timestamp('2017-09-23 00:00:00'): 25.80076,
pd.Timestamp('2017-09-24 00:00:00'): 25.79209,
pd.Timestamp('2017-09-25 00:00:00'): 25.80632,
pd.Timestamp('2017-09-26 00:00:00'): 25.77773,
pd.Timestamp('2017-09-27 00:00:00'): 25.76311}}
serie = pd.DataFrame(data).rename_axis(index='Date')