pandas 中是否有一种方法可以分组,然后在另一列具有指定值的情况下进行唯一计数?

Is there a way in pandas to groupby and then count unique where another column has a specified value?

我有一个包含许多列的 pandas 数据框。为简单起见,假设这些列是 'country'、'time_bucket'、'category' 和 'id'。 'category' 可以是 'staff' 或 'student'。

import pandas as pd
    data = {'country':  ['A', 'A', 'A', 'B', 'B',],
            'time_bucket': ['8', '8', '8', '8', '9'],
            'category': ['staff', 'staff', 'student','student','staff'],
            'id': ['101', '172', '122', '142', '132'],
            }
        
        df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])
df


country time_bucket category    id
0   A      8      staff        101
1   A      8      staff        172
2   A      8      student      122
3   B      8      student      142
4   B      9      staff        132

我想找出特定时间间隔内某个国家/地区的员工总数和学生总数,并将它们添加为新列。

我可以得到一个国家在特定时间间隔内的总人数:

df['persons_count'] = df.groupby(['time_bucket','country'])['id'].transform('nunique')

country time_bucket category    id  persons_count
0   A      8         staff      101    3
1   A      8         staff      172    3
2   A      8         student    122    3
3   B      8         student    142    1
4   B      9         staff      132    1

但是,我不知道如何考虑 'type' 并将其添加到我的代码中。

我想要这样的东西:

country time_bucket category    id  staff_count student_count
0   A     8          staff      101     2           1  
1   A     8          staff      172     2           1
2   A     8          student    122     2           1
3   B     8          student    142     0           1
4   B     9          staff      132     1           0

如有任何建议,我们将不胜感激!


添加一个新示例,显示需要唯一 'id' 计数

import pandas as pd
data = {'country':  ['A', 'A', 'A', 'A','B', 'B',],
                'time_bucket': ['8', '8', '8', '8', '8','9'],
                'category': ['staff', 'staff', 'student','student','student','staff'],
                'id': ['101', '172', '122', '122','142', '132'],
                }
        
df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])
df

country time_bucket category    id
0   A     8         staff       101
1   A     8         staff       172
2   A     8         student     122
3   A     8         student     122
4   B     8         student     142
5   B     9         staff       132

我想要这样的东西:

country time_bucket category    id  staff_count student_count
0   A     8          staff      101     2           1  
1   A     8          staff      172     2           1
2   A     8          student    122     2           1
3   A     8          student    122     2           1
4   B     8          student    142     0           1
5   B     9          staff      132     1           0
import pandas as pd
data = {'country':  ['A', 'A', 'A', 'B', 'B',],
    'time_bucket': ['8', '8', '8', '8', '9'],
    'category': ['staff', 'staff', 'student','student','staff'],
    'id': ['101', '172', '122', '142', '132'],
    }

df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])


df['persons_count'] = df.groupby(['time_bucket','country', 'category'])['id'].transform('nunique')

df = df.pivot_table(index=['country','time_bucket','id'], columns='category',values='persons_count').fillna(0)

输出

                     category   staff   student
country time_bucket        id       
      A           8       101     2.0       0.0
                          122     0.0       1.0
                          172     2.0       0.0
      B           8       142     0.0       1.0
                  9       132     1.0       0.0

  

我们可以将groupby操作与apply结合使用。 apply 将一个函数作为参数,该函数将为每个分组接收一个子数据帧。使用您提供的数据并按 [country, time_bucket] 分组,它将收到 3 行 [A,8],1 行 [B,8] 和 1 行 [B,9]

要获得您请求的输出:

import pandas as pd
from collections import Counter

data = {'country':  ['A', 'A', 'A', 'B', 'B'],
        'time_bucket': ['8', '8', '8', '8', '9'],
        'category': ['staff', 'staff', 'student', 'student', 'staff'],
        'id': ['101', '172', '122', '142', '132'],
        }

df = pd.DataFrame(data, columns=['country', 'time_bucket', 'category', 'id'])


def category_counter(row):
    counter = Counter(row.category.tolist())
    for k in ['staff', 'student']:
        row[k+'_count'] = counter[k]
    return row


df.groupby(['country', 'time_bucket']).apply(category_counter)

输出:

  country time_bucket category   id  staff_count  student_count
0       A           8    staff  101            2              1
1       A           8    staff  172            2              1
2       A           8  student  122            2              1
3       B           8  student  142            0              1
4       B           9    staff  132            1              0

不return重复数据的替代方案:

import pandas as pd
from collections import Counter

data = {'country':  ['A', 'A', 'A', 'B', 'B'],
        'time_bucket': ['8', '8', '8', '8', '9'],
        'category': ['staff', 'staff', 'student', 'student', 'staff'],
        'id': ['101', '172', '122', '142', '132'],
        }

df = pd.DataFrame(data, columns=['country', 'time_bucket', 'category', 'id'])


def category_counter(row):
    counter = Counter(row.category.tolist())
    return_data = {}
    for k in ['staff', 'student']:
        return_data[k+'_count'] = counter[k]

    return pd.Series(return_data)


df.groupby(['country', 'time_bucket']).apply(category_counter)

输出:

                     staff_count  student_count
country time_bucket
A       8                      2              1
B       8                      0              1
        9                      1              0