pandas 中是否有一种方法可以分组,然后在另一列具有指定值的情况下进行唯一计数?
Is there a way in pandas to groupby and then count unique where another column has a specified value?
我有一个包含许多列的 pandas 数据框。为简单起见,假设这些列是 'country'、'time_bucket'、'category' 和 'id'。 'category' 可以是 'staff' 或 'student'。
import pandas as pd
data = {'country': ['A', 'A', 'A', 'B', 'B',],
'time_bucket': ['8', '8', '8', '8', '9'],
'category': ['staff', 'staff', 'student','student','staff'],
'id': ['101', '172', '122', '142', '132'],
}
df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])
df
country time_bucket category id
0 A 8 staff 101
1 A 8 staff 172
2 A 8 student 122
3 B 8 student 142
4 B 9 staff 132
我想找出特定时间间隔内某个国家/地区的员工总数和学生总数,并将它们添加为新列。
我可以得到一个国家在特定时间间隔内的总人数:
df['persons_count'] = df.groupby(['time_bucket','country'])['id'].transform('nunique')
country time_bucket category id persons_count
0 A 8 staff 101 3
1 A 8 staff 172 3
2 A 8 student 122 3
3 B 8 student 142 1
4 B 9 staff 132 1
但是,我不知道如何考虑 'type' 并将其添加到我的代码中。
我想要这样的东西:
country time_bucket category id staff_count student_count
0 A 8 staff 101 2 1
1 A 8 staff 172 2 1
2 A 8 student 122 2 1
3 B 8 student 142 0 1
4 B 9 staff 132 1 0
如有任何建议,我们将不胜感激!
添加一个新示例,显示需要唯一 'id' 计数
import pandas as pd
data = {'country': ['A', 'A', 'A', 'A','B', 'B',],
'time_bucket': ['8', '8', '8', '8', '8','9'],
'category': ['staff', 'staff', 'student','student','student','staff'],
'id': ['101', '172', '122', '122','142', '132'],
}
df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])
df
country time_bucket category id
0 A 8 staff 101
1 A 8 staff 172
2 A 8 student 122
3 A 8 student 122
4 B 8 student 142
5 B 9 staff 132
我想要这样的东西:
country time_bucket category id staff_count student_count
0 A 8 staff 101 2 1
1 A 8 staff 172 2 1
2 A 8 student 122 2 1
3 A 8 student 122 2 1
4 B 8 student 142 0 1
5 B 9 staff 132 1 0
import pandas as pd
data = {'country': ['A', 'A', 'A', 'B', 'B',],
'time_bucket': ['8', '8', '8', '8', '9'],
'category': ['staff', 'staff', 'student','student','staff'],
'id': ['101', '172', '122', '142', '132'],
}
df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])
df['persons_count'] = df.groupby(['time_bucket','country', 'category'])['id'].transform('nunique')
df = df.pivot_table(index=['country','time_bucket','id'], columns='category',values='persons_count').fillna(0)
输出
category staff student
country time_bucket id
A 8 101 2.0 0.0
122 0.0 1.0
172 2.0 0.0
B 8 142 0.0 1.0
9 132 1.0 0.0
我们可以将groupby
操作与apply
结合使用。 apply
将一个函数作为参数,该函数将为每个分组接收一个子数据帧。使用您提供的数据并按 [country, time_bucket] 分组,它将收到 3 行 [A,8],1 行 [B,8] 和 1 行 [B,9]
要获得您请求的输出:
import pandas as pd
from collections import Counter
data = {'country': ['A', 'A', 'A', 'B', 'B'],
'time_bucket': ['8', '8', '8', '8', '9'],
'category': ['staff', 'staff', 'student', 'student', 'staff'],
'id': ['101', '172', '122', '142', '132'],
}
df = pd.DataFrame(data, columns=['country', 'time_bucket', 'category', 'id'])
def category_counter(row):
counter = Counter(row.category.tolist())
for k in ['staff', 'student']:
row[k+'_count'] = counter[k]
return row
df.groupby(['country', 'time_bucket']).apply(category_counter)
输出:
country time_bucket category id staff_count student_count
0 A 8 staff 101 2 1
1 A 8 staff 172 2 1
2 A 8 student 122 2 1
3 B 8 student 142 0 1
4 B 9 staff 132 1 0
不return重复数据的替代方案:
import pandas as pd
from collections import Counter
data = {'country': ['A', 'A', 'A', 'B', 'B'],
'time_bucket': ['8', '8', '8', '8', '9'],
'category': ['staff', 'staff', 'student', 'student', 'staff'],
'id': ['101', '172', '122', '142', '132'],
}
df = pd.DataFrame(data, columns=['country', 'time_bucket', 'category', 'id'])
def category_counter(row):
counter = Counter(row.category.tolist())
return_data = {}
for k in ['staff', 'student']:
return_data[k+'_count'] = counter[k]
return pd.Series(return_data)
df.groupby(['country', 'time_bucket']).apply(category_counter)
输出:
staff_count student_count
country time_bucket
A 8 2 1
B 8 0 1
9 1 0
我有一个包含许多列的 pandas 数据框。为简单起见,假设这些列是 'country'、'time_bucket'、'category' 和 'id'。 'category' 可以是 'staff' 或 'student'。
import pandas as pd
data = {'country': ['A', 'A', 'A', 'B', 'B',],
'time_bucket': ['8', '8', '8', '8', '9'],
'category': ['staff', 'staff', 'student','student','staff'],
'id': ['101', '172', '122', '142', '132'],
}
df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])
df
country time_bucket category id
0 A 8 staff 101
1 A 8 staff 172
2 A 8 student 122
3 B 8 student 142
4 B 9 staff 132
我想找出特定时间间隔内某个国家/地区的员工总数和学生总数,并将它们添加为新列。
我可以得到一个国家在特定时间间隔内的总人数:
df['persons_count'] = df.groupby(['time_bucket','country'])['id'].transform('nunique')
country time_bucket category id persons_count
0 A 8 staff 101 3
1 A 8 staff 172 3
2 A 8 student 122 3
3 B 8 student 142 1
4 B 9 staff 132 1
但是,我不知道如何考虑 'type' 并将其添加到我的代码中。
我想要这样的东西:
country time_bucket category id staff_count student_count
0 A 8 staff 101 2 1
1 A 8 staff 172 2 1
2 A 8 student 122 2 1
3 B 8 student 142 0 1
4 B 9 staff 132 1 0
如有任何建议,我们将不胜感激!
添加一个新示例,显示需要唯一 'id' 计数
import pandas as pd
data = {'country': ['A', 'A', 'A', 'A','B', 'B',],
'time_bucket': ['8', '8', '8', '8', '8','9'],
'category': ['staff', 'staff', 'student','student','student','staff'],
'id': ['101', '172', '122', '122','142', '132'],
}
df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])
df
country time_bucket category id
0 A 8 staff 101
1 A 8 staff 172
2 A 8 student 122
3 A 8 student 122
4 B 8 student 142
5 B 9 staff 132
我想要这样的东西:
country time_bucket category id staff_count student_count
0 A 8 staff 101 2 1
1 A 8 staff 172 2 1
2 A 8 student 122 2 1
3 A 8 student 122 2 1
4 B 8 student 142 0 1
5 B 9 staff 132 1 0
import pandas as pd
data = {'country': ['A', 'A', 'A', 'B', 'B',],
'time_bucket': ['8', '8', '8', '8', '9'],
'category': ['staff', 'staff', 'student','student','staff'],
'id': ['101', '172', '122', '142', '132'],
}
df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])
df['persons_count'] = df.groupby(['time_bucket','country', 'category'])['id'].transform('nunique')
df = df.pivot_table(index=['country','time_bucket','id'], columns='category',values='persons_count').fillna(0)
输出
category staff student
country time_bucket id
A 8 101 2.0 0.0
122 0.0 1.0
172 2.0 0.0
B 8 142 0.0 1.0
9 132 1.0 0.0
我们可以将groupby
操作与apply
结合使用。 apply
将一个函数作为参数,该函数将为每个分组接收一个子数据帧。使用您提供的数据并按 [country, time_bucket] 分组,它将收到 3 行 [A,8],1 行 [B,8] 和 1 行 [B,9]
要获得您请求的输出:
import pandas as pd
from collections import Counter
data = {'country': ['A', 'A', 'A', 'B', 'B'],
'time_bucket': ['8', '8', '8', '8', '9'],
'category': ['staff', 'staff', 'student', 'student', 'staff'],
'id': ['101', '172', '122', '142', '132'],
}
df = pd.DataFrame(data, columns=['country', 'time_bucket', 'category', 'id'])
def category_counter(row):
counter = Counter(row.category.tolist())
for k in ['staff', 'student']:
row[k+'_count'] = counter[k]
return row
df.groupby(['country', 'time_bucket']).apply(category_counter)
输出:
country time_bucket category id staff_count student_count
0 A 8 staff 101 2 1
1 A 8 staff 172 2 1
2 A 8 student 122 2 1
3 B 8 student 142 0 1
4 B 9 staff 132 1 0
不return重复数据的替代方案:
import pandas as pd
from collections import Counter
data = {'country': ['A', 'A', 'A', 'B', 'B'],
'time_bucket': ['8', '8', '8', '8', '9'],
'category': ['staff', 'staff', 'student', 'student', 'staff'],
'id': ['101', '172', '122', '142', '132'],
}
df = pd.DataFrame(data, columns=['country', 'time_bucket', 'category', 'id'])
def category_counter(row):
counter = Counter(row.category.tolist())
return_data = {}
for k in ['staff', 'student']:
return_data[k+'_count'] = counter[k]
return pd.Series(return_data)
df.groupby(['country', 'time_bucket']).apply(category_counter)
输出:
staff_count student_count
country time_bucket
A 8 2 1
B 8 0 1
9 1 0