按时间计算并发用户会话 bins/intervals
Counting conccurent user sessions by time bins/intervals
我的数据由带有开始和结束时间戳的会话组成。我的任务是按公司和应用程序版本计算每个时间间隔 'active' 的会话数。我从 2 分钟的间隔开始。因此,如果一家公司的会话从 2:00pm 持续到 2:07pm... 该公司将计入 4 bins/intervals(2:00,2:02,2:04, 2:06).我将如何使用 pandas 来解决这个问题?
这是我的示例数据:
data = pd.DataFrame({
'Customer': ["CompanyA", "CompanyB", "CompanyA", "CompanyC", "CompanyB", "CompanyB", "CompanyC", "CompanyA", "CompanyC", "CompanyB"],
'AppVer': ["zi1","zi3","zi2","zi2","zi1","zi3","zi3","zi2","zi1","zi1"],
'start_timestamp': pd.date_range(start='2020-06-01 22:19:00', freq='3min', periods=10),
'end_timestamp': pd.date_range(start='2020-06-01 22:23:00', freq='4min', periods=10)
}).sample(frac=1)#randomize rows
print(data)
我能够根据 2 分钟的间隔进行计数
session_starts = (x - pd.Timedelta(seconds=x.second) for x in data['start_timestamp'])
session_ends = (x - pd.Timedelta(seconds=x.second) for x in data['end_timestamp'])
sesszip = zip(session_starts,session_ends)
ranges_gen = (pd.date_range(x.round("2T"),y.round("2T"),freq='2T') for x,y in sesszip)
ranges_ser = pd.Series(chain.from_iterable(ranges_gen),name="time")
output_df = ranges_ser.value_counts(sort=False)\
.sort_index()\
.to_frame()\
.reset_index()\
.rename(columns={'index':'time' , 'time':'counts'})
output_df
我现在不知道如何按客户或应用程序版本进行分组。将不胜感激对不同方法的任何帮助或想法。
好的,我们可以使用这个重新创建您的 output_df
:
data['timerange'] = data.apply(lambda x:
pd.date_range(x['start_timestamp'].floor('T').round('2T'),
x['end_timestamp'].floor('T').round('2T'),
freq='2T'),
axis=1)
data.explode('timerange').groupby(['timerange'])['Customer'].count()
输出:
timerange
2020-06-01 22:20:00 1
2020-06-01 22:22:00 2
2020-06-01 22:24:00 3
2020-06-01 22:26:00 2
2020-06-01 22:28:00 3
2020-06-01 22:30:00 2
2020-06-01 22:32:00 3
2020-06-01 22:34:00 3
2020-06-01 22:36:00 4
2020-06-01 22:38:00 3
2020-06-01 22:40:00 4
2020-06-01 22:42:00 3
2020-06-01 22:44:00 4
2020-06-01 22:46:00 4
2020-06-01 22:48:00 4
2020-06-01 22:50:00 3
2020-06-01 22:52:00 3
2020-06-01 22:54:00 2
2020-06-01 22:56:00 2
2020-06-01 22:58:00 1
2020-06-01 23:00:00 1
Name: Customer, dtype: int64
按客户:
data.explode('timerange').groupby(['Customer','timerange'])['Customer'].count().unstack(0, fill_value=0)
输出:
Customer CompanyA CompanyB CompanyC
timerange
2020-06-01 22:20:00 1 0 0
2020-06-01 22:22:00 1 1 0
2020-06-01 22:24:00 2 1 0
2020-06-01 22:26:00 1 1 0
2020-06-01 22:28:00 1 1 1
2020-06-01 22:30:00 1 0 1
2020-06-01 22:32:00 1 1 1
2020-06-01 22:34:00 0 2 1
2020-06-01 22:36:00 0 2 2
2020-06-01 22:38:00 0 2 1
2020-06-01 22:40:00 1 2 1
2020-06-01 22:42:00 1 1 1
2020-06-01 22:44:00 1 1 2
2020-06-01 22:46:00 1 1 2
2020-06-01 22:48:00 1 1 2
2020-06-01 22:50:00 1 1 1
2020-06-01 22:52:00 1 1 1
2020-06-01 22:54:00 0 1 1
2020-06-01 22:56:00 0 1 1
2020-06-01 22:58:00 0 1 0
2020-06-01 23:00:00 0 1 0
以及 AppVer:
data.explode('timerange').groupby(['AppVer','timerange'])['AppVer'].count().unstack(0, fill_value=0)
输出:
AppVer zi1 zi2 zi3
timerange
2020-06-01 22:20:00 1 0 0
2020-06-01 22:22:00 1 0 1
2020-06-01 22:24:00 1 1 1
2020-06-01 22:26:00 0 1 1
2020-06-01 22:28:00 0 2 1
2020-06-01 22:30:00 0 2 0
2020-06-01 22:32:00 1 2 0
2020-06-01 22:34:00 1 1 1
2020-06-01 22:36:00 1 1 2
2020-06-01 22:38:00 1 0 2
2020-06-01 22:40:00 1 1 2
2020-06-01 22:42:00 0 1 2
2020-06-01 22:44:00 1 1 2
2020-06-01 22:46:00 2 1 1
2020-06-01 22:48:00 2 1 1
2020-06-01 22:50:00 2 1 0
2020-06-01 22:52:00 2 1 0
2020-06-01 22:54:00 2 0 0
2020-06-01 22:56:00 2 0 0
2020-06-01 22:58:00 1 0 0
2020-06-01 23:00:00 1 0 0
我的数据由带有开始和结束时间戳的会话组成。我的任务是按公司和应用程序版本计算每个时间间隔 'active' 的会话数。我从 2 分钟的间隔开始。因此,如果一家公司的会话从 2:00pm 持续到 2:07pm... 该公司将计入 4 bins/intervals(2:00,2:02,2:04, 2:06).我将如何使用 pandas 来解决这个问题?
这是我的示例数据:
data = pd.DataFrame({
'Customer': ["CompanyA", "CompanyB", "CompanyA", "CompanyC", "CompanyB", "CompanyB", "CompanyC", "CompanyA", "CompanyC", "CompanyB"],
'AppVer': ["zi1","zi3","zi2","zi2","zi1","zi3","zi3","zi2","zi1","zi1"],
'start_timestamp': pd.date_range(start='2020-06-01 22:19:00', freq='3min', periods=10),
'end_timestamp': pd.date_range(start='2020-06-01 22:23:00', freq='4min', periods=10)
}).sample(frac=1)#randomize rows
print(data)
我能够根据 2 分钟的间隔进行计数
session_starts = (x - pd.Timedelta(seconds=x.second) for x in data['start_timestamp'])
session_ends = (x - pd.Timedelta(seconds=x.second) for x in data['end_timestamp'])
sesszip = zip(session_starts,session_ends)
ranges_gen = (pd.date_range(x.round("2T"),y.round("2T"),freq='2T') for x,y in sesszip)
ranges_ser = pd.Series(chain.from_iterable(ranges_gen),name="time")
output_df = ranges_ser.value_counts(sort=False)\
.sort_index()\
.to_frame()\
.reset_index()\
.rename(columns={'index':'time' , 'time':'counts'})
output_df
我现在不知道如何按客户或应用程序版本进行分组。将不胜感激对不同方法的任何帮助或想法。
好的,我们可以使用这个重新创建您的 output_df
:
data['timerange'] = data.apply(lambda x:
pd.date_range(x['start_timestamp'].floor('T').round('2T'),
x['end_timestamp'].floor('T').round('2T'),
freq='2T'),
axis=1)
data.explode('timerange').groupby(['timerange'])['Customer'].count()
输出:
timerange
2020-06-01 22:20:00 1
2020-06-01 22:22:00 2
2020-06-01 22:24:00 3
2020-06-01 22:26:00 2
2020-06-01 22:28:00 3
2020-06-01 22:30:00 2
2020-06-01 22:32:00 3
2020-06-01 22:34:00 3
2020-06-01 22:36:00 4
2020-06-01 22:38:00 3
2020-06-01 22:40:00 4
2020-06-01 22:42:00 3
2020-06-01 22:44:00 4
2020-06-01 22:46:00 4
2020-06-01 22:48:00 4
2020-06-01 22:50:00 3
2020-06-01 22:52:00 3
2020-06-01 22:54:00 2
2020-06-01 22:56:00 2
2020-06-01 22:58:00 1
2020-06-01 23:00:00 1
Name: Customer, dtype: int64
按客户:
data.explode('timerange').groupby(['Customer','timerange'])['Customer'].count().unstack(0, fill_value=0)
输出:
Customer CompanyA CompanyB CompanyC
timerange
2020-06-01 22:20:00 1 0 0
2020-06-01 22:22:00 1 1 0
2020-06-01 22:24:00 2 1 0
2020-06-01 22:26:00 1 1 0
2020-06-01 22:28:00 1 1 1
2020-06-01 22:30:00 1 0 1
2020-06-01 22:32:00 1 1 1
2020-06-01 22:34:00 0 2 1
2020-06-01 22:36:00 0 2 2
2020-06-01 22:38:00 0 2 1
2020-06-01 22:40:00 1 2 1
2020-06-01 22:42:00 1 1 1
2020-06-01 22:44:00 1 1 2
2020-06-01 22:46:00 1 1 2
2020-06-01 22:48:00 1 1 2
2020-06-01 22:50:00 1 1 1
2020-06-01 22:52:00 1 1 1
2020-06-01 22:54:00 0 1 1
2020-06-01 22:56:00 0 1 1
2020-06-01 22:58:00 0 1 0
2020-06-01 23:00:00 0 1 0
以及 AppVer:
data.explode('timerange').groupby(['AppVer','timerange'])['AppVer'].count().unstack(0, fill_value=0)
输出:
AppVer zi1 zi2 zi3
timerange
2020-06-01 22:20:00 1 0 0
2020-06-01 22:22:00 1 0 1
2020-06-01 22:24:00 1 1 1
2020-06-01 22:26:00 0 1 1
2020-06-01 22:28:00 0 2 1
2020-06-01 22:30:00 0 2 0
2020-06-01 22:32:00 1 2 0
2020-06-01 22:34:00 1 1 1
2020-06-01 22:36:00 1 1 2
2020-06-01 22:38:00 1 0 2
2020-06-01 22:40:00 1 1 2
2020-06-01 22:42:00 0 1 2
2020-06-01 22:44:00 1 1 2
2020-06-01 22:46:00 2 1 1
2020-06-01 22:48:00 2 1 1
2020-06-01 22:50:00 2 1 0
2020-06-01 22:52:00 2 1 0
2020-06-01 22:54:00 2 0 0
2020-06-01 22:56:00 2 0 0
2020-06-01 22:58:00 1 0 0
2020-06-01 23:00:00 1 0 0