pandas 根据多个组对行求和

pandas sum rows base on multiple groups

我有这个数据框

df1

name         triggerid description                      time                            
srvjboss03   30708     Access URL A failed              01:19:23
srvjboss03   30708     Access URL A failed              01:18:21
srvglass01   32942     Service Glassfish OFFLINE        00:35:00
srvglass01   32942     Service Glassfish OFFLINE        00:35:00
srvglass01   22725     Access URL B failed              00:36:04
srvglass01   22725     Access URL B failed              00:36:07
srvglass01   22725     Access URL B failed              00:06:04
srvglass01   22725     Access URL B failed              00:06:04

期望输出为:

name         triggerid description                      time                            
srvjboss03   30708     Access URL A failed              02:31:44
srvglass01   32942     Service Glassfish OFFLINE        01:10:00
srvglass01   22725     Access URL B failed              01:24:21

时间是具有相同名称、triggerid 和描述的行中时间的总和。

我尝试将列名称、triggerid 和描述设置为索引,然后设置为一个组,但我明白了。

df1.set_index(['name', 'triggerid', 'descrption'], inplace=True)

df1.groupby(df.index)['time'].sum()


name         triggerid description                      time
srvjboss03   30708     Access URL A failed              01:19:23
                       Access URL A failed              01:18:21
srvglass01   32942     Service Glassfish OFFLINE        00:35:00
                       Service Glassfish OFFLINE        00:35:00
srvglass01   22725     Access URL B failed              00:36:04
                       Access URL B failed              00:36:07
                       Access URL B failed              00:06:04
                       Access URL B failed              00:06:04

列时间设置为timedelta64。 为什么pandas 不按名称和triggerid 的相同方式对描述进行分组? 如何获得所需的输出?

让我们试试这个。首先将时间列转换为timedelta。

df['time'] = pd.to_timedelta(df['time'])

df.groupby(['name','triggerid','description'])['time'].sum()\
  .reset_index()

输出:

         name  triggerid                description     time
0  srvglass01      22725        Access URL B failed 01:24:19
1  srvglass01      32942  Service Glassfish OFFLINE 01:10:00
2  srvjboss03      30708        Access URL A failed 02:37:44

其他选择:

df2 = df.set_index(['name','triggerid','description'])
df2.groupby(df2.index)['time'].sum()

输出:

(srvglass01, 22725, Access URL B failed)         01:24:19
(srvglass01, 32942, Service Glassfish OFFLINE)   01:10:00
(srvjboss03, 30708, Access URL A failed)         02:37:44
Name: time, dtype: timedelta64[ns]

df2.groupby(level=[0,1,2])['time'].sum()

输出:

name        triggerid  description              
srvglass01  22725      Access URL B failed         01:24:19
            32942      Service Glassfish OFFLINE   01:10:00
srvjboss03  30708      Access URL A failed         02:37:44
Name: time, dtype: timedelta64[ns]