如何离散化持续时间过长的时间序列?
How to discretize time series with overspilling durations?
我正在尝试离散化我的数据框,如下所示:
Start Date
Park Duration (mins)
Charge Duration (mins)
Energy (kWh)
49698
2016-01-01 11:48:00
230
92.0
3.034643
49710
2016-01-01 13:43:00
225
225.0
12.427662
49732
2016-01-01 22:43:00
708
111.0
10.752058
49736
2016-01-02 07:09:00
149
149.0
11.160776
49745
2016-01-02 10:29:00
156
156.0
10.298505
49758
2016-01-02 13:06:00
84
84.0
2.904127
49768
2016-01-02 15:00:00
27
26.0
2.573858
49773
2016-01-02 15:31:00
174
152.0
14.961943
49775
2016-01-02 16:01:00
195
167.0
16.317518
49790
2016-01-02 19:37:00
108
108.0
10.829344
49791
2016-01-02 19:56:00
289
26.0
2.552439
49802
2016-01-03 09:23:00
58
58.0
5.243358
49803
2016-01-03 09:33:00
264
134.0
6.782309
49813
2016-01-03 11:12:00
240
0.0
0.008115
49825
2016-01-03 14:12:00
97
96.0
5.29069
49833
2016-01-03 15:52:00
201
201.0
16.058235
49834
2016-01-03 15:52:00
53
52.0
5.304866
49840
2016-01-03 17:27:00
890
219.0
15.878921
49857
2016-01-04 05:57:00
198
127.0
6.368932
49871
2016-01-04 08:48:00
75
74.0
5.99877
我想做的是在 2 小时的时间段内对其进行采样,如下所示:
Start Date
Energy (kWh)
Charge Duration (mins)
Fee
2016-01-01 10:00:00
3.034643
92.0
0.0
2016-01-01 12:00:00
12.427662
225.0
0.0
2016-01-01 14:00:00
0.0
0.0
0.0
2016-01-01 16:00:00
0.0
0.0
0.0
2016-01-01 18:00:00
0.0
0.0
0.0
2016-01-01 20:00:00
0.0
0.0
0.0
2016-01-01 22:00:00
10.752058
111.0
0.0
2016-01-02 00:00:00
0.0
0.0
0.0
2016-01-02 02:00:00
0.0
0.0
0.0
2016-01-02 04:00:00
0.0
0.0
0.0
我用的是什么
data.resample('2H', on='Start Date').agg(({'Energy (kWh)':'sum','Charge Duration (mins)':'sum'}))
但是问题是数据溢出了,从第一行可以看出,充电持续时间是 92 分钟。然而,这 92 分钟中只有 12 分钟在 10:00:00 - 12:00:00 时间段内,但是我使用重新采样的方式将所有充电持续时间分配给该时间段。我想要的行为是根据 Start Date 和 Charge Duration 在时隙中“平均”分配它们,这样 12 分钟就会落入第一个时隙,其余 80 分钟会落入下一个时隙。也有 EV 充电超过 3 个周期的情况。我希望这是有道理的。
你会怎么做?
这是逗号分隔值的原始数据:
、开始日期、停放时间(分钟)、充电时间(分钟)、能量(千瓦时)
49698,2016-01-01 11:48:00, 230 ,92.0,3.034643
49710,2016-01-01 13:43:00, 225 ,225.0,12.427662
49732,2016-01-01 22:43:00, 708 ,111.0,10.752058
49736,2016-01-02 07:09:00, 149 ,149.0,11.160776
49745,2016-01-02 10:29:00, 156 ,156.0,10.298505
49758,2016-01-02 13:06:00, 84 ,84.0,2.904127
49768,2016-01-02 15:00:00, 27 ,26.0,2.573858
49773,2016-01-02 15:31:00, 174 ,152.0,14.961943
49775,2016-01-02 16:01:00, 195 ,167.0,16.317518
49790,2016-01-02 19:37:00, 108 ,108.0,10.829344
49791,2016-01-02 19:56:00, 289 ,26.0,2.552439
49802,2016-01-03 09:23:00, 58 ,58.0,5.243358
49803,2016-01-03 09:33:00, 264 ,134.0,6.782309
49813,2016-01-03 11:12:00, 240 ,0.0,0.008115
49825,2016-01-03 14:12:00, 97 ,96.0,5.29069
49833,2016-01-03 15:52:00, 201 ,201.0,16.058235
49834,2016-01-03 15:52:00, 53 ,52.0,5.304866
49840,2016-01-03 17:27:00, 890 ,219.0,15.878921
49857,2016-01-04 05:57:00, 198 ,127.0,6.368932
49871,2016-01-04 08:48:00, 75 ,74.0,5.99877
我没有看到一个完全直接的方法来解决这个问题。已有效地为每一行构建数据框,并使用比率将值拆分到目标行。
import io
df = pd.read_csv(io.StringIO(""" Start Date Park Duration (mins) Charge Duration (mins) Energy (kWh)
49698 2016-01-01 11:48:00 230 92.0 3.034643
49710 2016-01-01 13:43:00 225 225.0 12.427662
49732 2016-01-01 22:43:00 708 111.0 10.752058
49736 2016-01-02 07:09:00 149 149.0 11.160776
49745 2016-01-02 10:29:00 156 156.0 10.298505
49758 2016-01-02 13:06:00 84 84.0 2.904127
49768 2016-01-02 15:00:00 27 26.0 2.573858
49773 2016-01-02 15:31:00 174 152.0 14.961943
49775 2016-01-02 16:01:00 195 167.0 16.317518
49790 2016-01-02 19:37:00 108 108.0 10.829344
49791 2016-01-02 19:56:00 289 26.0 2.552439
49802 2016-01-03 09:23:00 58 58.0 5.243358
49803 2016-01-03 09:33:00 264 134.0 6.782309
49813 2016-01-03 11:12:00 240 0.0 0.008115
49825 2016-01-03 14:12:00 97 96.0 5.29069
49833 2016-01-03 15:52:00 201 201.0 16.058235
49834 2016-01-03 15:52:00 53 52.0 5.304866
49840 2016-01-03 17:27:00 890 219.0 15.878921
49857 2016-01-04 05:57:00 198 127.0 6.368932
49871 2016-01-04 08:48:00 75 74.0 5.99877"""), sep="\t", index_col=0)
df["Start Date"] = pd.to_datetime(df["Start Date"])
def proportionalsplit(s, freq="2H"):
st = s["Start Date"]
et = st + pd.Timedelta(minutes=s["Charge Duration (mins)"])
tr = pd.date_range(st.floor(freq), et, freq=freq)
lmin = {"2H":120}
# ratio of how numeric values should be split across new buckets
ratio = np.minimum((np.where(tr<st, tr.shift()-st, et-tr)/(10**9*60)).astype(int), np.full(len(tr),lmin[freq]))
ratio = ratio / ratio.sum()
return {"Start Date":tr, "Original Duration":np.full(len(tr), s["Charge Duration (mins)"]),
"Original Start":np.full(len(tr), s["Start Date"]),
"Original Index": np.full(len(tr), s.name),
"Charge Duration (mins)": s["Charge Duration (mins)"] * ratio,
"Energy (kWh)": s["Energy (kWh)"] * ratio,
}
df2 = pd.concat([pd.DataFrame(v) for v in df.apply(proportionalsplit, axis=1).values]).reset_index(drop=True)
# everything OK?
print(df2["Energy (kWh)"].sum().round(3)==df["Energy (kWh)"].sum().round(3),
df2["Charge Duration (mins)"].sum().round(3)==df["Charge Duration (mins)"].sum().round(3),)
# let's have a look at everything in 2H resample...
df3 = df2.groupby(["Start Date"]).agg({**{c:lambda s: list(s) for c in df2.columns if "Original" in c},
**{c:"sum" for c in ["Charge Duration (mins)","Energy (kWh)"]}})
输出
Original Duration Original Start Original Index Charge Duration (mins) Energy (kWh)
Start Date
2016-01-01 10:00:00 [92.0] [2016-01-01 11:48:00] [49698] 12.0 0.395823
2016-01-01 12:00:00 [92.0, 225.0] [2016-01-01 11:48:00, 2016-01-01 13:43:00] [49698, 49710] 97.0 3.577799
2016-01-01 14:00:00 [225.0] [2016-01-01 13:43:00] [49710] 120.0 6.628086
2016-01-01 16:00:00 [225.0] [2016-01-01 13:43:00] [49710] 88.0 4.860597
2016-01-01 22:00:00 [111.0] [2016-01-01 22:43:00] [49732] 77.0 7.458635
2016-01-02 00:00:00 [111.0] [2016-01-01 22:43:00] [49732] 34.0 3.293423
2016-01-02 06:00:00 [149.0] [2016-01-02 07:09:00] [49736] 51.0 3.820131
2016-01-02 08:00:00 [149.0] [2016-01-02 07:09:00] [49736] 98.0 7.340645
2016-01-02 10:00:00 [156.0] [2016-01-02 10:29:00] [49745] 91.0 6.007461
2016-01-02 12:00:00 [156.0, 84.0] [2016-01-02 10:29:00, 2016-01-02 13:06:00] [49745, 49758] 119.0 6.157983
2016-01-02 14:00:00 [84.0, 26.0, 152.0] [2016-01-02 13:06:00, 2016-01-02 15:00:00, 2016-01-02 15:31:00] [49758, 49768, 49773] 85.0 6.465627
2016-01-02 16:00:00 [152.0, 167.0] [2016-01-02 15:31:00, 2016-01-02 16:01:00] [49773, 49775] 239.0 23.439513
2016-01-02 18:00:00 [152.0, 167.0, 108.0, 26.0] [2016-01-02 15:31:00, 2016-01-02 16:01:00, 2016-01-02 19:37:00, 2016-01-02 19:56:00] [49773, 49775, 49790, 49791] 78.0 7.684299
2016-01-02 20:00:00 [108.0, 26.0] [2016-01-02 19:37:00, 2016-01-02 19:56:00] [49790, 49791] 107.0 10.682851
2016-01-03 08:00:00 [58.0, 134.0] [2016-01-03 09:23:00, 2016-01-03 09:33:00] [49802, 49803] 64.0 4.711485
2016-01-03 10:00:00 [58.0, 134.0, 0.0] [2016-01-03 09:23:00, 2016-01-03 09:33:00, 2016-01-03 11:12:00] [49802, 49803, 49813] 128.0 7.322297
2016-01-03 14:00:00 [96.0, 201.0, 52.0] [2016-01-03 14:12:00, 2016-01-03 15:52:00, 2016-01-03 15:52:00] [49825, 49833, 49834] 112.0 6.745957
2016-01-03 16:00:00 [201.0, 52.0, 219.0] [2016-01-03 15:52:00, 2016-01-03 15:52:00, 2016-01-03 17:27:00] [49833, 49834, 49840] 197.0 16.468453
2016-01-03 18:00:00 [201.0, 219.0] [2016-01-03 15:52:00, 2016-01-03 17:27:00] [49833, 49840] 193.0 14.532874
2016-01-03 20:00:00 [219.0] [2016-01-03 17:27:00] [49840] 66.0 4.785428
2016-01-04 04:00:00 [127.0] [2016-01-04 05:57:00] [49857] 3.0 0.150447
2016-01-04 06:00:00 [127.0] [2016-01-04 05:57:00] [49857] 120.0 6.017889
2016-01-04 08:00:00 [127.0, 74.0] [2016-01-04 05:57:00, 2016-01-04 08:48:00] [49857, 49871] 76.0 6.037237
2016-01-04 10:00:00 [74.0] [2016-01-04 08:48:00] [49871] 2.0 0.162129
我正在尝试离散化我的数据框,如下所示:
Start Date | Park Duration (mins) | Charge Duration (mins) | Energy (kWh) | |
---|---|---|---|---|
49698 | 2016-01-01 11:48:00 | 230 | 92.0 | 3.034643 |
49710 | 2016-01-01 13:43:00 | 225 | 225.0 | 12.427662 |
49732 | 2016-01-01 22:43:00 | 708 | 111.0 | 10.752058 |
49736 | 2016-01-02 07:09:00 | 149 | 149.0 | 11.160776 |
49745 | 2016-01-02 10:29:00 | 156 | 156.0 | 10.298505 |
49758 | 2016-01-02 13:06:00 | 84 | 84.0 | 2.904127 |
49768 | 2016-01-02 15:00:00 | 27 | 26.0 | 2.573858 |
49773 | 2016-01-02 15:31:00 | 174 | 152.0 | 14.961943 |
49775 | 2016-01-02 16:01:00 | 195 | 167.0 | 16.317518 |
49790 | 2016-01-02 19:37:00 | 108 | 108.0 | 10.829344 |
49791 | 2016-01-02 19:56:00 | 289 | 26.0 | 2.552439 |
49802 | 2016-01-03 09:23:00 | 58 | 58.0 | 5.243358 |
49803 | 2016-01-03 09:33:00 | 264 | 134.0 | 6.782309 |
49813 | 2016-01-03 11:12:00 | 240 | 0.0 | 0.008115 |
49825 | 2016-01-03 14:12:00 | 97 | 96.0 | 5.29069 |
49833 | 2016-01-03 15:52:00 | 201 | 201.0 | 16.058235 |
49834 | 2016-01-03 15:52:00 | 53 | 52.0 | 5.304866 |
49840 | 2016-01-03 17:27:00 | 890 | 219.0 | 15.878921 |
49857 | 2016-01-04 05:57:00 | 198 | 127.0 | 6.368932 |
49871 | 2016-01-04 08:48:00 | 75 | 74.0 | 5.99877 |
我想做的是在 2 小时的时间段内对其进行采样,如下所示:
Start Date | Energy (kWh) | Charge Duration (mins) | Fee |
---|---|---|---|
2016-01-01 10:00:00 | 3.034643 | 92.0 | 0.0 |
2016-01-01 12:00:00 | 12.427662 | 225.0 | 0.0 |
2016-01-01 14:00:00 | 0.0 | 0.0 | 0.0 |
2016-01-01 16:00:00 | 0.0 | 0.0 | 0.0 |
2016-01-01 18:00:00 | 0.0 | 0.0 | 0.0 |
2016-01-01 20:00:00 | 0.0 | 0.0 | 0.0 |
2016-01-01 22:00:00 | 10.752058 | 111.0 | 0.0 |
2016-01-02 00:00:00 | 0.0 | 0.0 | 0.0 |
2016-01-02 02:00:00 | 0.0 | 0.0 | 0.0 |
2016-01-02 04:00:00 | 0.0 | 0.0 | 0.0 |
我用的是什么
data.resample('2H', on='Start Date').agg(({'Energy (kWh)':'sum','Charge Duration (mins)':'sum'}))
但是问题是数据溢出了,从第一行可以看出,充电持续时间是 92 分钟。然而,这 92 分钟中只有 12 分钟在 10:00:00 - 12:00:00 时间段内,但是我使用重新采样的方式将所有充电持续时间分配给该时间段。我想要的行为是根据 Start Date 和 Charge Duration 在时隙中“平均”分配它们,这样 12 分钟就会落入第一个时隙,其余 80 分钟会落入下一个时隙。也有 EV 充电超过 3 个周期的情况。我希望这是有道理的。 你会怎么做?
这是逗号分隔值的原始数据:
、开始日期、停放时间(分钟)、充电时间(分钟)、能量(千瓦时) 49698,2016-01-01 11:48:00, 230 ,92.0,3.034643 49710,2016-01-01 13:43:00, 225 ,225.0,12.427662 49732,2016-01-01 22:43:00, 708 ,111.0,10.752058 49736,2016-01-02 07:09:00, 149 ,149.0,11.160776 49745,2016-01-02 10:29:00, 156 ,156.0,10.298505 49758,2016-01-02 13:06:00, 84 ,84.0,2.904127 49768,2016-01-02 15:00:00, 27 ,26.0,2.573858 49773,2016-01-02 15:31:00, 174 ,152.0,14.961943 49775,2016-01-02 16:01:00, 195 ,167.0,16.317518 49790,2016-01-02 19:37:00, 108 ,108.0,10.829344 49791,2016-01-02 19:56:00, 289 ,26.0,2.552439 49802,2016-01-03 09:23:00, 58 ,58.0,5.243358 49803,2016-01-03 09:33:00, 264 ,134.0,6.782309 49813,2016-01-03 11:12:00, 240 ,0.0,0.008115 49825,2016-01-03 14:12:00, 97 ,96.0,5.29069 49833,2016-01-03 15:52:00, 201 ,201.0,16.058235 49834,2016-01-03 15:52:00, 53 ,52.0,5.304866 49840,2016-01-03 17:27:00, 890 ,219.0,15.878921 49857,2016-01-04 05:57:00, 198 ,127.0,6.368932 49871,2016-01-04 08:48:00, 75 ,74.0,5.99877
我没有看到一个完全直接的方法来解决这个问题。已有效地为每一行构建数据框,并使用比率将值拆分到目标行。
import io
df = pd.read_csv(io.StringIO(""" Start Date Park Duration (mins) Charge Duration (mins) Energy (kWh)
49698 2016-01-01 11:48:00 230 92.0 3.034643
49710 2016-01-01 13:43:00 225 225.0 12.427662
49732 2016-01-01 22:43:00 708 111.0 10.752058
49736 2016-01-02 07:09:00 149 149.0 11.160776
49745 2016-01-02 10:29:00 156 156.0 10.298505
49758 2016-01-02 13:06:00 84 84.0 2.904127
49768 2016-01-02 15:00:00 27 26.0 2.573858
49773 2016-01-02 15:31:00 174 152.0 14.961943
49775 2016-01-02 16:01:00 195 167.0 16.317518
49790 2016-01-02 19:37:00 108 108.0 10.829344
49791 2016-01-02 19:56:00 289 26.0 2.552439
49802 2016-01-03 09:23:00 58 58.0 5.243358
49803 2016-01-03 09:33:00 264 134.0 6.782309
49813 2016-01-03 11:12:00 240 0.0 0.008115
49825 2016-01-03 14:12:00 97 96.0 5.29069
49833 2016-01-03 15:52:00 201 201.0 16.058235
49834 2016-01-03 15:52:00 53 52.0 5.304866
49840 2016-01-03 17:27:00 890 219.0 15.878921
49857 2016-01-04 05:57:00 198 127.0 6.368932
49871 2016-01-04 08:48:00 75 74.0 5.99877"""), sep="\t", index_col=0)
df["Start Date"] = pd.to_datetime(df["Start Date"])
def proportionalsplit(s, freq="2H"):
st = s["Start Date"]
et = st + pd.Timedelta(minutes=s["Charge Duration (mins)"])
tr = pd.date_range(st.floor(freq), et, freq=freq)
lmin = {"2H":120}
# ratio of how numeric values should be split across new buckets
ratio = np.minimum((np.where(tr<st, tr.shift()-st, et-tr)/(10**9*60)).astype(int), np.full(len(tr),lmin[freq]))
ratio = ratio / ratio.sum()
return {"Start Date":tr, "Original Duration":np.full(len(tr), s["Charge Duration (mins)"]),
"Original Start":np.full(len(tr), s["Start Date"]),
"Original Index": np.full(len(tr), s.name),
"Charge Duration (mins)": s["Charge Duration (mins)"] * ratio,
"Energy (kWh)": s["Energy (kWh)"] * ratio,
}
df2 = pd.concat([pd.DataFrame(v) for v in df.apply(proportionalsplit, axis=1).values]).reset_index(drop=True)
# everything OK?
print(df2["Energy (kWh)"].sum().round(3)==df["Energy (kWh)"].sum().round(3),
df2["Charge Duration (mins)"].sum().round(3)==df["Charge Duration (mins)"].sum().round(3),)
# let's have a look at everything in 2H resample...
df3 = df2.groupby(["Start Date"]).agg({**{c:lambda s: list(s) for c in df2.columns if "Original" in c},
**{c:"sum" for c in ["Charge Duration (mins)","Energy (kWh)"]}})
输出
Original Duration Original Start Original Index Charge Duration (mins) Energy (kWh)
Start Date
2016-01-01 10:00:00 [92.0] [2016-01-01 11:48:00] [49698] 12.0 0.395823
2016-01-01 12:00:00 [92.0, 225.0] [2016-01-01 11:48:00, 2016-01-01 13:43:00] [49698, 49710] 97.0 3.577799
2016-01-01 14:00:00 [225.0] [2016-01-01 13:43:00] [49710] 120.0 6.628086
2016-01-01 16:00:00 [225.0] [2016-01-01 13:43:00] [49710] 88.0 4.860597
2016-01-01 22:00:00 [111.0] [2016-01-01 22:43:00] [49732] 77.0 7.458635
2016-01-02 00:00:00 [111.0] [2016-01-01 22:43:00] [49732] 34.0 3.293423
2016-01-02 06:00:00 [149.0] [2016-01-02 07:09:00] [49736] 51.0 3.820131
2016-01-02 08:00:00 [149.0] [2016-01-02 07:09:00] [49736] 98.0 7.340645
2016-01-02 10:00:00 [156.0] [2016-01-02 10:29:00] [49745] 91.0 6.007461
2016-01-02 12:00:00 [156.0, 84.0] [2016-01-02 10:29:00, 2016-01-02 13:06:00] [49745, 49758] 119.0 6.157983
2016-01-02 14:00:00 [84.0, 26.0, 152.0] [2016-01-02 13:06:00, 2016-01-02 15:00:00, 2016-01-02 15:31:00] [49758, 49768, 49773] 85.0 6.465627
2016-01-02 16:00:00 [152.0, 167.0] [2016-01-02 15:31:00, 2016-01-02 16:01:00] [49773, 49775] 239.0 23.439513
2016-01-02 18:00:00 [152.0, 167.0, 108.0, 26.0] [2016-01-02 15:31:00, 2016-01-02 16:01:00, 2016-01-02 19:37:00, 2016-01-02 19:56:00] [49773, 49775, 49790, 49791] 78.0 7.684299
2016-01-02 20:00:00 [108.0, 26.0] [2016-01-02 19:37:00, 2016-01-02 19:56:00] [49790, 49791] 107.0 10.682851
2016-01-03 08:00:00 [58.0, 134.0] [2016-01-03 09:23:00, 2016-01-03 09:33:00] [49802, 49803] 64.0 4.711485
2016-01-03 10:00:00 [58.0, 134.0, 0.0] [2016-01-03 09:23:00, 2016-01-03 09:33:00, 2016-01-03 11:12:00] [49802, 49803, 49813] 128.0 7.322297
2016-01-03 14:00:00 [96.0, 201.0, 52.0] [2016-01-03 14:12:00, 2016-01-03 15:52:00, 2016-01-03 15:52:00] [49825, 49833, 49834] 112.0 6.745957
2016-01-03 16:00:00 [201.0, 52.0, 219.0] [2016-01-03 15:52:00, 2016-01-03 15:52:00, 2016-01-03 17:27:00] [49833, 49834, 49840] 197.0 16.468453
2016-01-03 18:00:00 [201.0, 219.0] [2016-01-03 15:52:00, 2016-01-03 17:27:00] [49833, 49840] 193.0 14.532874
2016-01-03 20:00:00 [219.0] [2016-01-03 17:27:00] [49840] 66.0 4.785428
2016-01-04 04:00:00 [127.0] [2016-01-04 05:57:00] [49857] 3.0 0.150447
2016-01-04 06:00:00 [127.0] [2016-01-04 05:57:00] [49857] 120.0 6.017889
2016-01-04 08:00:00 [127.0, 74.0] [2016-01-04 05:57:00, 2016-01-04 08:48:00] [49857, 49871] 76.0 6.037237
2016-01-04 10:00:00 [74.0] [2016-01-04 08:48:00] [49871] 2.0 0.162129