pandas 根据多个组对行求和
pandas sum rows base on multiple groups
我有这个数据框
df1
name triggerid description time
srvjboss03 30708 Access URL A failed 01:19:23
srvjboss03 30708 Access URL A failed 01:18:21
srvglass01 32942 Service Glassfish OFFLINE 00:35:00
srvglass01 32942 Service Glassfish OFFLINE 00:35:00
srvglass01 22725 Access URL B failed 00:36:04
srvglass01 22725 Access URL B failed 00:36:07
srvglass01 22725 Access URL B failed 00:06:04
srvglass01 22725 Access URL B failed 00:06:04
期望输出为:
name triggerid description time
srvjboss03 30708 Access URL A failed 02:31:44
srvglass01 32942 Service Glassfish OFFLINE 01:10:00
srvglass01 22725 Access URL B failed 01:24:21
时间是具有相同名称、triggerid 和描述的行中时间的总和。
我尝试将列名称、triggerid 和描述设置为索引,然后设置为一个组,但我明白了。
df1.set_index(['name', 'triggerid', 'descrption'], inplace=True)
df1.groupby(df.index)['time'].sum()
name triggerid description time
srvjboss03 30708 Access URL A failed 01:19:23
Access URL A failed 01:18:21
srvglass01 32942 Service Glassfish OFFLINE 00:35:00
Service Glassfish OFFLINE 00:35:00
srvglass01 22725 Access URL B failed 00:36:04
Access URL B failed 00:36:07
Access URL B failed 00:06:04
Access URL B failed 00:06:04
列时间设置为timedelta64。
为什么pandas 不按名称和triggerid 的相同方式对描述进行分组?
如何获得所需的输出?
让我们试试这个。首先将时间列转换为timedelta。
df['time'] = pd.to_timedelta(df['time'])
df.groupby(['name','triggerid','description'])['time'].sum()\
.reset_index()
输出:
name triggerid description time
0 srvglass01 22725 Access URL B failed 01:24:19
1 srvglass01 32942 Service Glassfish OFFLINE 01:10:00
2 srvjboss03 30708 Access URL A failed 02:37:44
其他选择:
df2 = df.set_index(['name','triggerid','description'])
df2.groupby(df2.index)['time'].sum()
输出:
(srvglass01, 22725, Access URL B failed) 01:24:19
(srvglass01, 32942, Service Glassfish OFFLINE) 01:10:00
(srvjboss03, 30708, Access URL A failed) 02:37:44
Name: time, dtype: timedelta64[ns]
或
df2.groupby(level=[0,1,2])['time'].sum()
输出:
name triggerid description
srvglass01 22725 Access URL B failed 01:24:19
32942 Service Glassfish OFFLINE 01:10:00
srvjboss03 30708 Access URL A failed 02:37:44
Name: time, dtype: timedelta64[ns]
我有这个数据框
df1
name triggerid description time
srvjboss03 30708 Access URL A failed 01:19:23
srvjboss03 30708 Access URL A failed 01:18:21
srvglass01 32942 Service Glassfish OFFLINE 00:35:00
srvglass01 32942 Service Glassfish OFFLINE 00:35:00
srvglass01 22725 Access URL B failed 00:36:04
srvglass01 22725 Access URL B failed 00:36:07
srvglass01 22725 Access URL B failed 00:06:04
srvglass01 22725 Access URL B failed 00:06:04
期望输出为:
name triggerid description time
srvjboss03 30708 Access URL A failed 02:31:44
srvglass01 32942 Service Glassfish OFFLINE 01:10:00
srvglass01 22725 Access URL B failed 01:24:21
时间是具有相同名称、triggerid 和描述的行中时间的总和。
我尝试将列名称、triggerid 和描述设置为索引,然后设置为一个组,但我明白了。
df1.set_index(['name', 'triggerid', 'descrption'], inplace=True)
df1.groupby(df.index)['time'].sum()
name triggerid description time
srvjboss03 30708 Access URL A failed 01:19:23
Access URL A failed 01:18:21
srvglass01 32942 Service Glassfish OFFLINE 00:35:00
Service Glassfish OFFLINE 00:35:00
srvglass01 22725 Access URL B failed 00:36:04
Access URL B failed 00:36:07
Access URL B failed 00:06:04
Access URL B failed 00:06:04
列时间设置为timedelta64。 为什么pandas 不按名称和triggerid 的相同方式对描述进行分组? 如何获得所需的输出?
让我们试试这个。首先将时间列转换为timedelta。
df['time'] = pd.to_timedelta(df['time'])
df.groupby(['name','triggerid','description'])['time'].sum()\
.reset_index()
输出:
name triggerid description time
0 srvglass01 22725 Access URL B failed 01:24:19
1 srvglass01 32942 Service Glassfish OFFLINE 01:10:00
2 srvjboss03 30708 Access URL A failed 02:37:44
其他选择:
df2 = df.set_index(['name','triggerid','description'])
df2.groupby(df2.index)['time'].sum()
输出:
(srvglass01, 22725, Access URL B failed) 01:24:19
(srvglass01, 32942, Service Glassfish OFFLINE) 01:10:00
(srvjboss03, 30708, Access URL A failed) 02:37:44
Name: time, dtype: timedelta64[ns]
或
df2.groupby(level=[0,1,2])['time'].sum()
输出:
name triggerid description
srvglass01 22725 Access URL B failed 01:24:19
32942 Service Glassfish OFFLINE 01:10:00
srvjboss03 30708 Access URL A failed 02:37:44
Name: time, dtype: timedelta64[ns]