基于多个变量的迭代计数器

Iterative counter based on several variables

我正在尝试构建一个计数器来跟踪多个不同用户的失败和成功次数。我有一个数据框,其中包含重复的用户代码(如果有更多关于同一用户的事件)和一个时间戳来跟踪时间变量。我想添加两列(成功数、失败数)来累积前面事件的结果。

示例数据:

data=pd.DataFrame(
    {
        'user_id': [2,2,3,2,4,5,3,3,6,6,6,7],
        'timestamp': [1567641600,1567691600,1567741600,1567941600, 1567981600, 1567991600,1568391600,1568541600,1568741600,1568941600,1568981600,1568988600],
        'status': ['yes','no','yes','no', 'yes', 'yes','yes','no','no','yes','no','yes']
    }
)

我尝试在 R 中使用一些循环,但我担心我遗漏了一些东西,也许 Python 中有更好的方法来做到这一点?

想要的结果应该是这样的:

data=pd.DataFrame(
    {
        'user_id': [2,2,3,2,4,5,3,3,6,6,6,7],
        'timestamp': [1567641600,1567691600,1567741600,1567941600, 1567981600, 1567991600,1568391600,1568541600,1568741600,1568941600,1568981600,1568988600],
        'status': ['yes','no','yes','no', 'yes', 'yes','yes','no','no','yes','no','yes'],
        'number_yes':[1,1,1,1,1,1,2,2,0,1,1,1],
        'number_no':[0,1,0,2,0,0,0,1,1,1,2,0]
    }
)

使用,Series.eq to create a boolean mask, then use Series.groupby, on this mask and transform the grouped series using .cumsum:

m = data['status'].eq('yes')
data = data.assign(
    number_yes=m.groupby(data['user_id']).cumsum(),
    number_no=(~m).groupby(data['user_id']).cumsum()
)

# print(data)
    user_id   timestamp status  number_yes  number_no
0         2  1567641600    yes         1.0        0.0
1         2  1567691600     no         1.0        1.0
2         3  1567741600    yes         1.0        0.0
3         2  1567941600     no         1.0        2.0
4         4  1567981600    yes         1.0        0.0
5         5  1567991600    yes         1.0        0.0
6         3  1568391600    yes         2.0        0.0
7         3  1568541600     no         2.0        1.0
8         6  1568741600     no         0.0        1.0
9         6  1568941600    yes         1.0        1.0
10        6  1568981600     no         1.0        2.0
11        7  1568988600    yes         1.0        0.0
data['number_yes'] = data.groupby('user_id').status.transform(lambda x: (x == 'yes').cumsum())
data['number_no'] = data.groupby('user_id').status.transform(lambda x: (x == 'no').cumsum())

结果:

    user_id   timestamp status  number_yes  number_no
0         2  1567641600    yes           1          0
1         2  1567691600     no           1          1
2         3  1567741600    yes           1          0
3         2  1567941600     no           1          2
4         4  1567981600    yes           1          0
5         5  1567991600    yes           1          0
6         3  1568391600    yes           2          0
7         3  1568541600     no           2          1
8         6  1568741600     no           0          1
9         6  1568941600    yes           1          1
10        6  1568981600     no           1          2
11        7  1568988600    yes           1          0

让我们使用 get_dummies:

data.join(data['status'].str.get_dummies()
                        .groupby(data['user_id']).cumsum()
                        .add_prefix('Number_'))

输出:

    user_id   timestamp status  Number_no  Number_yes
0         2  1567641600    yes          0           1
1         2  1567691600     no          1           1
2         3  1567741600    yes          0           1
3         2  1567941600     no          2           1
4         4  1567981600    yes          0           1
5         5  1567991600    yes          0           1
6         3  1568391600    yes          0           2
7         3  1568541600     no          1           2
8         6  1568741600     no          1           0
9         6  1568941600    yes          1           1
10        6  1568981600     no          2           1
11        7  1568988600    yes          0           1

我喜欢使用 str.get_dummies 的地方在于它不仅可以处理 'yes' 和 'no',让我们插入一个新状态 'maybe':

data=pd.DataFrame(
    {
        'user_id': [2,2,3,2,4,5,3,3,6,6,6,7],
        'timestamp': [1567641600,1567691600,1567741600,1567941600, 1567981600, 1567991600,1568391600,1568541600,1568741600,1568941600,1568981600,1568988600],
        'status': ['yes','no','yes','no', 'maybe', 'yes','yes','no','maybe','yes','no','yes']
    })

data.join(data['status'].str.get_dummies()
                        .groupby(data['user_id']).cumsum()
                        .add_prefix('Number_'))

输出:

    user_id   timestamp status  Number_maybe  Number_no  Number_yes
0         2  1567641600    yes             0          0           1
1         2  1567691600     no             0          1           1
2         3  1567741600    yes             0          0           1
3         2  1567941600     no             0          2           1
4         4  1567981600  maybe             1          0           0
5         5  1567991600    yes             0          0           1
6         3  1568391600    yes             0          0           2
7         3  1568541600     no             0          1           2
8         6  1568741600  maybe             1          0           0
9         6  1568941600    yes             1          0           1
10        6  1568981600     no             1          1           1
11        7  1568988600    yes             0          0           1