枚举 Python 中的分组变量

Question

我有一个使用 Python 和患者 ID 号的数据框，每条记录代表不同的约会。在每次预约时，一个特征 (dx) 被记录为 0 或 1。我想创建一个新特征来总结 dx 特征，但仅限于该患者的那个点。

patient_ID   |   dx   |   
 29847            0
 29847            1
 29847            0
 29847            1
 29847            1

我可以用一个简单的 groupby 语句得到组的总和：

df.groupby(['patient_ID])['dx'].sum()

但我想要的是将枚举值作为一项新功能，仅考虑当前和以前的记录：

patient_ID   |   dx   |   dx_enum
 29847            0         0
 29847            1         1
 29847            0         1
 29847            1         2
 29847            1         3

我想这将结合使用 for 循环和 groupby 语句，但到目前为止还没有成功。感谢您提供的任何帮助！

Answer 1

如果我明白你想要什么，你可以通过执行 groupby 然后调用 transform 并传递函数 cumsum:

来添加列

In [44]:

df['dx_enum'] = df.groupby('patient_ID')['dx'].transform(pd.Series.cumsum)
df
Out[44]:
   patient_ID  dx  dx_enum
0       29847   0        0
1       29847   1        1
2       29847   0        1
3       29847   1        2
4       29847   1        3

Transform returns a series aligned to the original df so you can add it as a column, see the docs: http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation

枚举 Python 中的分组变量

Enumerating a grouped variable in Python

python

grouping

pandas