在日期之间有效地聚合
Aggregate efficiently between dates
你好,我有一个 Df 看起来像这样:
HostName Date
0 B 2021-01-01 12:42:00
1 B 2021-02-01 12:30:00
2 B 2021-02-01 12:40:00
3 B 2021-02-25 12:40:00
4 B 2021-03-01 12:41:00
5 B 2021-03-01 12:42:00
6 B 2021-03-02 12:43:00
7 B 2021-03-03 12:44:00
8 B 2021-04-04 12:44:00
9 B 2021-06-05 12:44:00
10 B 2021-08-06 12:44:00
11 B 2021-09-07 12:44:00
12 A 2021-03-12 12:45:00
13 A 2021-03-13 12:46:00
我对聚合做了什么这就是我解决问题的方法,但它根本没有效率,如果有 1M 行
这需要很长时间
有没有更好的方法在日期之间有效聚合?
最终结果:
HostName Date ds
0 B 2021-01-01 12:42:00 1
1 B 2021-02-01 12:30:00 2
2 B 2021-02-01 12:40:00 3
3 B 2021-02-25 12:40:00 3
4 B 2021-03-01 12:41:00 2
5 B 2021-03-01 12:42:00 3
6 B 2021-03-02 12:43:00 4
7 B 2021-03-03 12:44:00 5
8 B 2021-04-04 12:44:00 1
9 B 2021-06-05 12:44:00 1
10 B 2021-08-06 12:44:00 1
11 B 2021-09-07 12:44:00 1
12 A 2021-03-12 12:45:00 1
13 A 2021-03-13 12:46:00 2
TheList = []
for index, row in df.iterrows():
TheList.append((df[(df['Date'] > (df['Date'].iloc[index] - pd.DateOffset(months=1))) & (df['Date'] <= df['Date'].iloc[index])].groupby(['HostName']).size()[row[0]]))
df['ds'] = TheList
是否有更好的方法来实现相同的结果?
此处用于组间广播和计数 True
s 在自定义函数中使用 sum
GroupBy.transform
:
注意:性能还取决于组的长度,如果这里的几个非常大的组应该是内存问题。
df['Date'] = pd.to_datetime(df['Date'])
def f(x):
a = x.to_numpy()
b = x.sub(pd.DateOffset(months=1)).to_numpy()
return np.sum((a > b[:, None]) & (a <= a[:, None]), axis=1)
df['ds'] = df.groupby('HostName')['Date'].transform(f)
print (df)
HostName Date ds
0 B 2021-01-01 12:42:00 1
1 B 2021-02-01 12:30:00 2
2 B 2021-02-01 12:40:00 3
3 B 2021-02-25 12:40:00 3
4 B 2021-03-01 12:41:00 2
5 B 2021-03-01 12:42:00 3
6 B 2021-03-02 12:43:00 4
7 B 2021-03-03 12:44:00 5
8 B 2021-04-04 12:44:00 1
9 B 2021-06-05 12:44:00 1
10 B 2021-08-06 12:44:00 1
11 B 2021-09-07 12:44:00 1
12 A 2021-03-12 12:45:00 1
13 A 2021-03-13 12:46:00 2
不幸的是,如果内存问题需要循环:
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = pd.to_datetime(df['Date']).sub(pd.DateOffset(months=1))
def f(x):
one = x['Date'].to_numpy()
both = x[['Date','Date1']].to_numpy()
x['ds'] = [np.sum((one > b) & (one <= a)) for a, b in both]
return x
df = df.groupby('HostName').apply(f)
print (df)
HostName Date Date1 ds
0 B 2021-01-01 12:42:00 2020-12-01 12:42:00 1
1 B 2021-02-01 12:30:00 2021-01-01 12:30:00 2
2 B 2021-02-01 12:40:00 2021-01-01 12:40:00 3
3 B 2021-02-25 12:40:00 2021-01-25 12:40:00 3
4 B 2021-03-01 12:41:00 2021-02-01 12:41:00 2
5 B 2021-03-01 12:42:00 2021-02-01 12:42:00 3
6 B 2021-03-02 12:43:00 2021-02-02 12:43:00 4
7 B 2021-03-03 12:44:00 2021-02-03 12:44:00 5
8 B 2021-04-04 12:44:00 2021-03-04 12:44:00 1
9 B 2021-06-05 12:44:00 2021-05-05 12:44:00 1
10 B 2021-08-06 12:44:00 2021-07-06 12:44:00 1
11 B 2021-09-07 12:44:00 2021-08-07 12:44:00 1
12 A 2021-03-12 12:45:00 2021-02-12 12:45:00 1
13 A 2021-03-13 12:46:00 2021-02-13 12:46:00 2
你好,我有一个 Df 看起来像这样:
HostName Date
0 B 2021-01-01 12:42:00
1 B 2021-02-01 12:30:00
2 B 2021-02-01 12:40:00
3 B 2021-02-25 12:40:00
4 B 2021-03-01 12:41:00
5 B 2021-03-01 12:42:00
6 B 2021-03-02 12:43:00
7 B 2021-03-03 12:44:00
8 B 2021-04-04 12:44:00
9 B 2021-06-05 12:44:00
10 B 2021-08-06 12:44:00
11 B 2021-09-07 12:44:00
12 A 2021-03-12 12:45:00
13 A 2021-03-13 12:46:00
我对聚合做了什么这就是我解决问题的方法,但它根本没有效率,如果有 1M 行 这需要很长时间 有没有更好的方法在日期之间有效聚合?
最终结果:
HostName Date ds
0 B 2021-01-01 12:42:00 1
1 B 2021-02-01 12:30:00 2
2 B 2021-02-01 12:40:00 3
3 B 2021-02-25 12:40:00 3
4 B 2021-03-01 12:41:00 2
5 B 2021-03-01 12:42:00 3
6 B 2021-03-02 12:43:00 4
7 B 2021-03-03 12:44:00 5
8 B 2021-04-04 12:44:00 1
9 B 2021-06-05 12:44:00 1
10 B 2021-08-06 12:44:00 1
11 B 2021-09-07 12:44:00 1
12 A 2021-03-12 12:45:00 1
13 A 2021-03-13 12:46:00 2
TheList = []
for index, row in df.iterrows():
TheList.append((df[(df['Date'] > (df['Date'].iloc[index] - pd.DateOffset(months=1))) & (df['Date'] <= df['Date'].iloc[index])].groupby(['HostName']).size()[row[0]]))
df['ds'] = TheList
是否有更好的方法来实现相同的结果?
此处用于组间广播和计数 True
s 在自定义函数中使用 sum
GroupBy.transform
:
注意:性能还取决于组的长度,如果这里的几个非常大的组应该是内存问题。
df['Date'] = pd.to_datetime(df['Date'])
def f(x):
a = x.to_numpy()
b = x.sub(pd.DateOffset(months=1)).to_numpy()
return np.sum((a > b[:, None]) & (a <= a[:, None]), axis=1)
df['ds'] = df.groupby('HostName')['Date'].transform(f)
print (df)
HostName Date ds
0 B 2021-01-01 12:42:00 1
1 B 2021-02-01 12:30:00 2
2 B 2021-02-01 12:40:00 3
3 B 2021-02-25 12:40:00 3
4 B 2021-03-01 12:41:00 2
5 B 2021-03-01 12:42:00 3
6 B 2021-03-02 12:43:00 4
7 B 2021-03-03 12:44:00 5
8 B 2021-04-04 12:44:00 1
9 B 2021-06-05 12:44:00 1
10 B 2021-08-06 12:44:00 1
11 B 2021-09-07 12:44:00 1
12 A 2021-03-12 12:45:00 1
13 A 2021-03-13 12:46:00 2
不幸的是,如果内存问题需要循环:
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = pd.to_datetime(df['Date']).sub(pd.DateOffset(months=1))
def f(x):
one = x['Date'].to_numpy()
both = x[['Date','Date1']].to_numpy()
x['ds'] = [np.sum((one > b) & (one <= a)) for a, b in both]
return x
df = df.groupby('HostName').apply(f)
print (df)
HostName Date Date1 ds
0 B 2021-01-01 12:42:00 2020-12-01 12:42:00 1
1 B 2021-02-01 12:30:00 2021-01-01 12:30:00 2
2 B 2021-02-01 12:40:00 2021-01-01 12:40:00 3
3 B 2021-02-25 12:40:00 2021-01-25 12:40:00 3
4 B 2021-03-01 12:41:00 2021-02-01 12:41:00 2
5 B 2021-03-01 12:42:00 2021-02-01 12:42:00 3
6 B 2021-03-02 12:43:00 2021-02-02 12:43:00 4
7 B 2021-03-03 12:44:00 2021-02-03 12:44:00 5
8 B 2021-04-04 12:44:00 2021-03-04 12:44:00 1
9 B 2021-06-05 12:44:00 2021-05-05 12:44:00 1
10 B 2021-08-06 12:44:00 2021-07-06 12:44:00 1
11 B 2021-09-07 12:44:00 2021-08-07 12:44:00 1
12 A 2021-03-12 12:45:00 2021-02-12 12:45:00 1
13 A 2021-03-13 12:46:00 2021-02-13 12:46:00 2