Groupby 并计算唯一值的数量 (Pandas)

Question

我有一个包含 2 个变量的数据框：ID 和 outcome。我首先尝试 groupby ID，然后计算 ID 中 outcome 的唯一值的数量。

df
ID    outcome
1      yes
1      yes
1      yes
2      no
2      yes
2      no

预期输出：

ID    yes    no
1      3     0
2      1     2

我的代码df[['PID', 'outcome']].groupby('PID')['outcome'].nunique()给出了唯一值本身的数量，这样：

ID
1   2
2   2

但是我需要 yes 和 no 的计数，我该如何实现？谢谢！

Answer 1

在 ID 列上分组，然后在 outcome 列上使用 value_counts 进行聚合。这将产生一个系列，因此您需要使用 .to_frame() 将其转换回数据框，以便您可以取消堆叠 yes/no （即将它们作为列）。然后用零填充空值。

df_total = df.groupby('ID')['outcome'].value_counts().to_frame().unstack(fill_value=0)
df_total.columns = df_total.columns.droplevel()
>>> df_total
outcome  no  yes
ID              
1         0    3
2         2    1

Answer 2

pd.crosstab怎么样？

In [1217]: pd.crosstab(df.ID, df.outcome)
Out[1217]: 
outcome  no  yes
ID              
1         0    3
2         2    1

Answer 3

使用set_index和pd.concat

df1 = df.set_index('ID')
pd.concat([df1.outcome.eq('yes').sum(level=0),
          df1.outcome.ne('yes').sum(level=0)], keys=['yes','no'],axis=1).reset_index()

输出：

   ID  yes   no
0   1  3.0  0.0
1   2  1.0  2.0

Answer 4

将防止任何过去、现在和将来的错误并利用 FAST 向量化函数的最有效设置是执行（非常简单的）以下操作：

df['dummy_yes'] = df.outcome == 'yes'
df['dummy_no'] = df.outcome == 'no'

df.groupby('ID').sum()

Answer 5

选项 2
pd.factorize + np.bincount
这是令人费解和痛苦的......但非常快。

fi, ui = pd.factorize(df.ID.values)
fo, uo = pd.factorize(df.outcome.values)

n, m = ui.size, uo.size
pd.DataFrame(
    np.bincount(fi * m + fo, minlength=n * m).reshape(n, m),
    pd.Index(ui, name='ID'), pd.Index(uo, name='outcome')
)

outcome  yes  no
ID              
1          3   0
2          1   2

选项 C

pd.get_dummies(d.ID).T.dot(pd.get_dummies(d.outcome))

   no  yes
1   0    3
2   2    1

选项四。

df.groupby(['ID', 'outcome']).size().unstack(fill_value=0)

Groupby 并计算唯一值的数量 (Pandas)

Groupby and count the number of unique values (Pandas)

python

unique

count

dataframe

pandas