使用 python/pandas 将分类数据分组到其他分类数据之上
Grouping categorical data over other categorical data with python/pandas
我有一个 pandas 数据框,其中一列存储特定任务的名称,另一列报告执行该任务的员工的 ID 号。类似于:
EMPLOYEE_ID TASK_NAME
Employee1 Inspection
Employee2 Inspection
Employee3 Inspection
Employee4 Inspection
Employee5 Inspection
Employee1 Change
Employee2 Inspection
Employee3 Change
Employee1 Change
Employee2 Change
我想知道我必须做什么类型的 command/analyses 才能 group/cluster 员工完成任务。换句话说,例如,"Employee_Group_1"(包括 Employee1、Employee2、Employee3)执行了所有 Inspection 和 Change 任务的 75%..
如有任何帮助,我们将不胜感激!
提前致谢。
我认为需要 map
by flattened dictionary
called d1
with Series.value_counts
:
d = {'g1':['Employee1', 'Employee2', 'Employee3'],
'g2':['Employee4', 'Employee5', 'Employee6']}
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
print (d1)
{'Employee1': 'g1', 'Employee2': 'g1', 'Employee3': 'g1',
'Employee4': 'g2', 'Employee5': 'g2', 'Employee6': 'g2'}
s = df['EMPLOYEE_ID'].map(d1).value_counts(normalize=True)
print (s)
g1 0.8
g2 0.2
Name: EMPLOYEE_ID, dtype: float64
如果还想分析另一列使用SeriesGroupBy.value_counts
:
df2 = (df.groupby(df['EMPLOYEE_ID'].map(d1))['TASK_NAME']
.value_counts(normalize=True)
.reset_index(name='norm'))
print (df2)
EMPLOYEE_ID TASK_NAME norm
0 g1 Change 0.5
1 g1 Inspection 0.5
2 g2 Inspection 1.0
详情:
print (df['EMPLOYEE_ID'].map(d1))
0 g1
1 g1
2 g1
3 g2
4 g2
5 g1
6 g1
7 g1
8 g1
9 g1
Name: EMPLOYEE_ID, dtype: object
我有一个 pandas 数据框,其中一列存储特定任务的名称,另一列报告执行该任务的员工的 ID 号。类似于:
EMPLOYEE_ID TASK_NAME
Employee1 Inspection
Employee2 Inspection
Employee3 Inspection
Employee4 Inspection
Employee5 Inspection
Employee1 Change
Employee2 Inspection
Employee3 Change
Employee1 Change
Employee2 Change
我想知道我必须做什么类型的 command/analyses 才能 group/cluster 员工完成任务。换句话说,例如,"Employee_Group_1"(包括 Employee1、Employee2、Employee3)执行了所有 Inspection 和 Change 任务的 75%..
如有任何帮助,我们将不胜感激! 提前致谢。
我认为需要 map
by flattened dictionary
called d1
with Series.value_counts
:
d = {'g1':['Employee1', 'Employee2', 'Employee3'],
'g2':['Employee4', 'Employee5', 'Employee6']}
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
print (d1)
{'Employee1': 'g1', 'Employee2': 'g1', 'Employee3': 'g1',
'Employee4': 'g2', 'Employee5': 'g2', 'Employee6': 'g2'}
s = df['EMPLOYEE_ID'].map(d1).value_counts(normalize=True)
print (s)
g1 0.8
g2 0.2
Name: EMPLOYEE_ID, dtype: float64
如果还想分析另一列使用SeriesGroupBy.value_counts
:
df2 = (df.groupby(df['EMPLOYEE_ID'].map(d1))['TASK_NAME']
.value_counts(normalize=True)
.reset_index(name='norm'))
print (df2)
EMPLOYEE_ID TASK_NAME norm
0 g1 Change 0.5
1 g1 Inspection 0.5
2 g2 Inspection 1.0
详情:
print (df['EMPLOYEE_ID'].map(d1))
0 g1
1 g1
2 g1
3 g2
4 g2
5 g1
6 g1
7 g1
8 g1
9 g1
Name: EMPLOYEE_ID, dtype: object