如何在 pandas 数据帧中应用递归数字滤波器?
How to apply a recursive digital filter in a pandas dataframe?
我有一个像这样的数据框:
days1 = pd.date_range('2020-01-01 01:00:00','2020-01-01 01:19:00',freq='60s')
DF = pd.DataFrame({'Time': days1,
'TimeSeries1': [10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20],
'TimeSeries2': [11, 12, 13, 12, 11, 14, 15, 16, 21, 20, 20, 23, 15, 15, 15, 15, 15, 15, 15, 15]})
我想得到以下信息:
- 对于每个 TimeSeries 列(TimeSeries1 和 TimeSeries2),我想创建一个对应的“_Filtered”列,即:
TimeSeries1_Filtered[i] = (1-A)* TimeSeries1_Filtered[i-1] + A*TimeSeries1[i]
“A”是介于 0 和 1 之间的过滤因子。
对于每一列,我需要使用不同的“A”因子。例如:TimeSeries1 的 A1=0.5,TimeSeries1 的 A2=0.8。
我有超过 100 个“TimeSeriesN”列,因此最好以元组或列表的形式传递“A#”参数。
示例 A1=0.5
Time TimeSeries1 TimeSeries1_Filtered
0 2020-01-01 01:00:00 10 10
1 2020-01-01 01:01:00 10 10
2 2020-01-01 01:02:00 10 10
3 2020-01-01 01:03:00 20 15
4 2020-01-01 01:04:00 20 17.5
5 2020-01-01 01:05:00 20 18.75
6 2020-01-01 01:06:00 20 19.375
7 2020-01-01 01:07:00 20 19.6875
8 2020-01-01 01:08:00 20 19.84375
9 2020-01-01 01:09:00 20 19.92188
10 2020-01-01 01:10:00 20 19.96094
11 ... ... ...
谢谢!
编辑:对滤波器符号和方程的更正。感谢@not_speshal 的提醒。
对于第 n 个数据点,递归公式的计算结果为:
filtered[n] = A*(x[n] + (1-A)*x[n-1] + (1-A)**2 * x[n-2] +...) + (1-A)**n * x[0]
您现在可以创建返回上述内容的自定义函数并将其应用于您的数据框:
def ts_filter(srs, A):
return srs.expanding().apply(lambda x: A*(x*((1-A)**np.arange(len(x))[::-1])).sum() + (1-A)**x.size*x.iat[0])
factors = {"TimeSeries1": 0.5, "TimeSeries2": 0.2}
filtered = df.filter(like="TimeSeries").apply(lambda x: ts_filter(x, A=factors[x.name]))
output = df.join(filtered, rsuffix="_filtered")
输出:
>>> output
Time TimeSeries1 ... TimeSeries1_filtered TimeSeries2_filtered
0 2020-01-01 01:00:00 10 ... 10.000000 11.000000
1 2020-01-01 01:01:00 10 ... 10.000000 11.200000
2 2020-01-01 01:02:00 10 ... 10.000000 11.560000
3 2020-01-01 01:03:00 20 ... 15.000000 11.648000
4 2020-01-01 01:04:00 20 ... 17.500000 11.518400
5 2020-01-01 01:05:00 20 ... 18.750000 12.014720
6 2020-01-01 01:06:00 20 ... 19.375000 12.611776
7 2020-01-01 01:07:00 20 ... 19.687500 13.289421
8 2020-01-01 01:08:00 20 ... 19.843750 14.831537
9 2020-01-01 01:09:00 20 ... 19.921875 15.865229
10 2020-01-01 01:10:00 20 ... 19.960938 16.692183
11 2020-01-01 01:11:00 20 ... 19.980469 17.953747
12 2020-01-01 01:12:00 20 ... 19.990234 17.362997
13 2020-01-01 01:13:00 20 ... 19.995117 16.890398
14 2020-01-01 01:14:00 20 ... 19.997559 16.512318
15 2020-01-01 01:15:00 20 ... 19.998779 16.209855
16 2020-01-01 01:16:00 20 ... 19.999390 15.967884
17 2020-01-01 01:17:00 20 ... 19.999695 15.774307
18 2020-01-01 01:18:00 20 ... 19.999847 15.619446
19 2020-01-01 01:19:00 20 ... 19.999924 15.495556
为什么不用像scipy.signal这样的时间序列过滤包呢?
这就是我使用 scipy.signal.lfilter
进行过滤的方式:
(感谢@not_speshal指出OP差分方程中的错误)
from scipy.signal import lfilter
coeffs = {'TimeSeries1': 0.5, 'TimeSeries2': 0.8}
for label, a in coeffs.items():
DF[f"{label}_Filtered"] = lfilter([a], [1, a-1], DF[label])
但是,您似乎假设初始条件基于每个滤波器在时间 i=0
处于稳态。此解决方案产生您想要的结果:
from scipy.signal import lfilter, lfiltic
coeffs = {'TimeSeries1': 0.5, 'TimeSeries2': 0.8}
for label, a in coeffs.items():
y_prev = DF[label].iloc[0] # previous filtered value
zi = lfiltic([a], [1, a-1], [y_prev]) # initial condition
DF[f"{label}_Filtered"] = lfilter([a], [1, a-1], DF[label], zi=zi)[0]
print(DF)
输出:
Time TimeSeries1 TimeSeries2 TimeSeries1_Filtered TimeSeries2_Filtered
0 2020-01-01 01:00:00 10 11 10.000000 11.000000
1 2020-01-01 01:01:00 10 12 10.000000 11.800000
2 2020-01-01 01:02:00 10 13 10.000000 12.760000
3 2020-01-01 01:03:00 20 12 15.000000 12.152000
4 2020-01-01 01:04:00 20 11 17.500000 11.230400
5 2020-01-01 01:05:00 20 14 18.750000 13.446080
...
我刚刚意识到,您使用的自回归过滤器相当于 exponentially-weighted moving average filter 已经存在于 Pandas 中。您只需要关闭前几个采样周期通常使用的调整因子即可。
coeffs = {'TimeSeries1': 0.5, 'TimeSeries2': 0.8}
for label, a in coeffs.items():
DF[f"{label}_Filtered"] = DF[label].ewm(alpha=a, adjust=False).mean()
print(DF)
输出:
Time TimeSeries1 TimeSeries2 TimeSeries1_Filtered TimeSeries2_Filtered
0 2020-01-01 01:00:00 10 11 10.000000 11.000000
1 2020-01-01 01:01:00 10 12 10.000000 11.800000
2 2020-01-01 01:02:00 10 13 10.000000 12.760000
3 2020-01-01 01:03:00 20 12 15.000000 12.152000
4 2020-01-01 01:04:00 20 11 17.500000 11.230400
5 2020-01-01 01:05:00 20 14 18.750000 13.446080
...
我有一个像这样的数据框:
days1 = pd.date_range('2020-01-01 01:00:00','2020-01-01 01:19:00',freq='60s')
DF = pd.DataFrame({'Time': days1,
'TimeSeries1': [10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20],
'TimeSeries2': [11, 12, 13, 12, 11, 14, 15, 16, 21, 20, 20, 23, 15, 15, 15, 15, 15, 15, 15, 15]})
我想得到以下信息:
- 对于每个 TimeSeries 列(TimeSeries1 和 TimeSeries2),我想创建一个对应的“_Filtered”列,即: TimeSeries1_Filtered[i] = (1-A)* TimeSeries1_Filtered[i-1] + A*TimeSeries1[i]
“A”是介于 0 和 1 之间的过滤因子。
对于每一列,我需要使用不同的“A”因子。例如:TimeSeries1 的 A1=0.5,TimeSeries1 的 A2=0.8。
我有超过 100 个“TimeSeriesN”列,因此最好以元组或列表的形式传递“A#”参数。
示例 A1=0.5
Time TimeSeries1 TimeSeries1_Filtered
0 2020-01-01 01:00:00 10 10
1 2020-01-01 01:01:00 10 10
2 2020-01-01 01:02:00 10 10
3 2020-01-01 01:03:00 20 15
4 2020-01-01 01:04:00 20 17.5
5 2020-01-01 01:05:00 20 18.75
6 2020-01-01 01:06:00 20 19.375
7 2020-01-01 01:07:00 20 19.6875
8 2020-01-01 01:08:00 20 19.84375
9 2020-01-01 01:09:00 20 19.92188
10 2020-01-01 01:10:00 20 19.96094
11 ... ... ...
谢谢!
编辑:对滤波器符号和方程的更正。感谢@not_speshal 的提醒。
对于第 n 个数据点,递归公式的计算结果为:
filtered[n] = A*(x[n] + (1-A)*x[n-1] + (1-A)**2 * x[n-2] +...) + (1-A)**n * x[0]
您现在可以创建返回上述内容的自定义函数并将其应用于您的数据框:
def ts_filter(srs, A):
return srs.expanding().apply(lambda x: A*(x*((1-A)**np.arange(len(x))[::-1])).sum() + (1-A)**x.size*x.iat[0])
factors = {"TimeSeries1": 0.5, "TimeSeries2": 0.2}
filtered = df.filter(like="TimeSeries").apply(lambda x: ts_filter(x, A=factors[x.name]))
output = df.join(filtered, rsuffix="_filtered")
输出:
>>> output
Time TimeSeries1 ... TimeSeries1_filtered TimeSeries2_filtered
0 2020-01-01 01:00:00 10 ... 10.000000 11.000000
1 2020-01-01 01:01:00 10 ... 10.000000 11.200000
2 2020-01-01 01:02:00 10 ... 10.000000 11.560000
3 2020-01-01 01:03:00 20 ... 15.000000 11.648000
4 2020-01-01 01:04:00 20 ... 17.500000 11.518400
5 2020-01-01 01:05:00 20 ... 18.750000 12.014720
6 2020-01-01 01:06:00 20 ... 19.375000 12.611776
7 2020-01-01 01:07:00 20 ... 19.687500 13.289421
8 2020-01-01 01:08:00 20 ... 19.843750 14.831537
9 2020-01-01 01:09:00 20 ... 19.921875 15.865229
10 2020-01-01 01:10:00 20 ... 19.960938 16.692183
11 2020-01-01 01:11:00 20 ... 19.980469 17.953747
12 2020-01-01 01:12:00 20 ... 19.990234 17.362997
13 2020-01-01 01:13:00 20 ... 19.995117 16.890398
14 2020-01-01 01:14:00 20 ... 19.997559 16.512318
15 2020-01-01 01:15:00 20 ... 19.998779 16.209855
16 2020-01-01 01:16:00 20 ... 19.999390 15.967884
17 2020-01-01 01:17:00 20 ... 19.999695 15.774307
18 2020-01-01 01:18:00 20 ... 19.999847 15.619446
19 2020-01-01 01:19:00 20 ... 19.999924 15.495556
为什么不用像scipy.signal这样的时间序列过滤包呢?
这就是我使用 scipy.signal.lfilter
进行过滤的方式:
(感谢@not_speshal指出OP差分方程中的错误)
from scipy.signal import lfilter
coeffs = {'TimeSeries1': 0.5, 'TimeSeries2': 0.8}
for label, a in coeffs.items():
DF[f"{label}_Filtered"] = lfilter([a], [1, a-1], DF[label])
但是,您似乎假设初始条件基于每个滤波器在时间 i=0
处于稳态。此解决方案产生您想要的结果:
from scipy.signal import lfilter, lfiltic
coeffs = {'TimeSeries1': 0.5, 'TimeSeries2': 0.8}
for label, a in coeffs.items():
y_prev = DF[label].iloc[0] # previous filtered value
zi = lfiltic([a], [1, a-1], [y_prev]) # initial condition
DF[f"{label}_Filtered"] = lfilter([a], [1, a-1], DF[label], zi=zi)[0]
print(DF)
输出:
Time TimeSeries1 TimeSeries2 TimeSeries1_Filtered TimeSeries2_Filtered
0 2020-01-01 01:00:00 10 11 10.000000 11.000000
1 2020-01-01 01:01:00 10 12 10.000000 11.800000
2 2020-01-01 01:02:00 10 13 10.000000 12.760000
3 2020-01-01 01:03:00 20 12 15.000000 12.152000
4 2020-01-01 01:04:00 20 11 17.500000 11.230400
5 2020-01-01 01:05:00 20 14 18.750000 13.446080
...
我刚刚意识到,您使用的自回归过滤器相当于 exponentially-weighted moving average filter 已经存在于 Pandas 中。您只需要关闭前几个采样周期通常使用的调整因子即可。
coeffs = {'TimeSeries1': 0.5, 'TimeSeries2': 0.8}
for label, a in coeffs.items():
DF[f"{label}_Filtered"] = DF[label].ewm(alpha=a, adjust=False).mean()
print(DF)
输出:
Time TimeSeries1 TimeSeries2 TimeSeries1_Filtered TimeSeries2_Filtered
0 2020-01-01 01:00:00 10 11 10.000000 11.000000
1 2020-01-01 01:01:00 10 12 10.000000 11.800000
2 2020-01-01 01:02:00 10 13 10.000000 12.760000
3 2020-01-01 01:03:00 20 12 15.000000 12.152000
4 2020-01-01 01:04:00 20 11 17.500000 11.230400
5 2020-01-01 01:05:00 20 14 18.750000 13.446080
...