使用 python/pandas 将分类数据分组到其他分类数据之上

Grouping categorical data over other categorical data with python/pandas

我有一个 pandas 数据框,其中一列存储特定任务的名称,另一列报告执行该任务的员工的 ID 号。类似于:

EMPLOYEE_ID    TASK_NAME 

Employee1     Inspection  
Employee2     Inspection
Employee3     Inspection
Employee4     Inspection
Employee5     Inspection
Employee1     Change
Employee2     Inspection
Employee3     Change
Employee1     Change
Employee2     Change

我想知道我必须做什么类型的 command/analyses 才能 group/cluster 员工完成任务。换句话说,例如,"Employee_Group_1"(包括 Employee1、Employee2、Employee3)执行了所有 Inspection 和 Change 任务的 75%..

如有任何帮助,我们将不胜感激! 提前致谢。

我认为需要 map by flattened dictionary called d1 with Series.value_counts:

d = {'g1':['Employee1', 'Employee2', 'Employee3'],
     'g2':['Employee4', 'Employee5', 'Employee6']}

d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
print (d1)
{'Employee1': 'g1', 'Employee2': 'g1', 'Employee3': 'g1', 
 'Employee4': 'g2', 'Employee5': 'g2', 'Employee6': 'g2'}

s = df['EMPLOYEE_ID'].map(d1).value_counts(normalize=True)
print (s)
g1    0.8
g2    0.2
Name: EMPLOYEE_ID, dtype: float64

如果还想分析另一列使用SeriesGroupBy.value_counts:

df2 = (df.groupby(df['EMPLOYEE_ID'].map(d1))['TASK_NAME']
         .value_counts(normalize=True)
         .reset_index(name='norm'))
print (df2)
  EMPLOYEE_ID   TASK_NAME  norm
0          g1      Change   0.5
1          g1  Inspection   0.5
2          g2  Inspection   1.0

详情:

print (df['EMPLOYEE_ID'].map(d1))
0    g1
1    g1
2    g1
3    g2
4    g2
5    g1
6    g1
7    g1
8    g1
9    g1
Name: EMPLOYEE_ID, dtype: object