Pandas 从加入日期算起每天的事件

Pandas count event per day from join date

我有这个数据框:

name    event     join_date    created_at    
A       X         2020-12-01   2020-12-01
A       X         2020-12-01   2020-12-01
A       X         2020-12-01   2020-12-02
A       Y         2020-12-01   2020-12-02
B       X         2020-12-05   2020-12-05
B       X         2020-12-05   2020-12-07
C       X         2020-12-07   2020-12-08
C       X         2020-12-07   2020-12-09
...

我想把它转换成这个数据框:

name   event    join_date    day_0   day_1    day_2 .... day_n
A      X        2020-12-01   2       1        0          0
A      Y        2020-12-01   0       1        0          0
B      X        2020-12-05   1       0        1          0
C      X        2020-12-07   0       1        1          0
...

第一行表示用户 A 在 day_0(他加入的第一天)做了两次事件 X,在第一天做了一次,依此类推,直到 day_n

现在,结果是这样的:

name   event    join_date    day_0   day_1    day_2 .... day_n
A      X        2020-12-01   2       1        0          0
A      Y        2020-12-01   0       1        0          0
B      X        2020-12-05   1       0        1          0
C      X        2020-12-07   1       1        0          0
...

代码将 2020-12-02 设置为 day_0,而不是 day_1,因为 A 用户没有 2020-12-01 事件

首先将所有值 created_at 减去每组第一个值 GroupBy.transform

然后使用DataFrame.pivot_table first, add all possible datetimes by DataFrame.reindex by timedelta_range然后通过range转换列名:

df['d'] = df['created_at'].sub(df['join_date'])
print (df)
  name event  join_date created_at      d
0    A     X 2020-12-01 2020-12-01 0 days
1    A     X 2020-12-01 2020-12-01 0 days
2    A     X 2020-12-01 2020-12-02 1 days
3    A     Y 2020-12-01 2020-12-02 1 days
4    B     X 2020-12-05 2020-12-05 0 days
5    B     X 2020-12-05 2020-12-07 2 days
6    C     X 2020-12-07 2020-12-08 1 days
7    C     X 2020-12-07 2020-12-09 2 days

df1 = (df.pivot_table(index=['name','event','join_date'], 
                     columns='d', 
                     aggfunc='size', 
                     fill_value=0)
         .reindex(pd.timedelta_range(df['d'].min(), df['d'].max()), 
                  axis=1, 
                  fill_value=0))
df1.columns = [f'day_{i}' for i in range(len(df1.columns))]
df1 = df1.reset_index()
print (df1)
  name event  join_date  day_0  day_1  day_2
0    A     X 2020-12-01      2      1      0
1    A     Y 2020-12-01      0      1      0
2    B     X 2020-12-05      1      0      1
3    C     X 2020-12-07      0      1      1