Pandas 从加入日期算起每天的事件
Pandas count event per day from join date
我有这个数据框:
name event join_date created_at
A X 2020-12-01 2020-12-01
A X 2020-12-01 2020-12-01
A X 2020-12-01 2020-12-02
A Y 2020-12-01 2020-12-02
B X 2020-12-05 2020-12-05
B X 2020-12-05 2020-12-07
C X 2020-12-07 2020-12-08
C X 2020-12-07 2020-12-09
...
我想把它转换成这个数据框:
name event join_date day_0 day_1 day_2 .... day_n
A X 2020-12-01 2 1 0 0
A Y 2020-12-01 0 1 0 0
B X 2020-12-05 1 0 1 0
C X 2020-12-07 0 1 1 0
...
第一行表示用户 A 在 day_0(他加入的第一天)做了两次事件 X,在第一天做了一次,依此类推,直到 day_n
现在,结果是这样的:
name event join_date day_0 day_1 day_2 .... day_n
A X 2020-12-01 2 1 0 0
A Y 2020-12-01 0 1 0 0
B X 2020-12-05 1 0 1 0
C X 2020-12-07 1 1 0 0
...
代码将 2020-12-02 设置为 day_0,而不是 day_1,因为 A 用户没有 2020-12-01 事件
首先将所有值 created_at
减去每组第一个值 GroupBy.transform
。
然后使用DataFrame.pivot_table
first, add all possible datetimes by DataFrame.reindex
by timedelta_range
然后通过range
转换列名:
df['d'] = df['created_at'].sub(df['join_date'])
print (df)
name event join_date created_at d
0 A X 2020-12-01 2020-12-01 0 days
1 A X 2020-12-01 2020-12-01 0 days
2 A X 2020-12-01 2020-12-02 1 days
3 A Y 2020-12-01 2020-12-02 1 days
4 B X 2020-12-05 2020-12-05 0 days
5 B X 2020-12-05 2020-12-07 2 days
6 C X 2020-12-07 2020-12-08 1 days
7 C X 2020-12-07 2020-12-09 2 days
df1 = (df.pivot_table(index=['name','event','join_date'],
columns='d',
aggfunc='size',
fill_value=0)
.reindex(pd.timedelta_range(df['d'].min(), df['d'].max()),
axis=1,
fill_value=0))
df1.columns = [f'day_{i}' for i in range(len(df1.columns))]
df1 = df1.reset_index()
print (df1)
name event join_date day_0 day_1 day_2
0 A X 2020-12-01 2 1 0
1 A Y 2020-12-01 0 1 0
2 B X 2020-12-05 1 0 1
3 C X 2020-12-07 0 1 1
我有这个数据框:
name event join_date created_at
A X 2020-12-01 2020-12-01
A X 2020-12-01 2020-12-01
A X 2020-12-01 2020-12-02
A Y 2020-12-01 2020-12-02
B X 2020-12-05 2020-12-05
B X 2020-12-05 2020-12-07
C X 2020-12-07 2020-12-08
C X 2020-12-07 2020-12-09
...
我想把它转换成这个数据框:
name event join_date day_0 day_1 day_2 .... day_n
A X 2020-12-01 2 1 0 0
A Y 2020-12-01 0 1 0 0
B X 2020-12-05 1 0 1 0
C X 2020-12-07 0 1 1 0
...
第一行表示用户 A 在 day_0(他加入的第一天)做了两次事件 X,在第一天做了一次,依此类推,直到 day_n
现在,结果是这样的:
name event join_date day_0 day_1 day_2 .... day_n
A X 2020-12-01 2 1 0 0
A Y 2020-12-01 0 1 0 0
B X 2020-12-05 1 0 1 0
C X 2020-12-07 1 1 0 0
...
代码将 2020-12-02 设置为 day_0,而不是 day_1,因为 A 用户没有 2020-12-01 事件
首先将所有值 created_at
减去每组第一个值 GroupBy.transform
。
然后使用DataFrame.pivot_table
first, add all possible datetimes by DataFrame.reindex
by timedelta_range
然后通过range
转换列名:
df['d'] = df['created_at'].sub(df['join_date'])
print (df)
name event join_date created_at d
0 A X 2020-12-01 2020-12-01 0 days
1 A X 2020-12-01 2020-12-01 0 days
2 A X 2020-12-01 2020-12-02 1 days
3 A Y 2020-12-01 2020-12-02 1 days
4 B X 2020-12-05 2020-12-05 0 days
5 B X 2020-12-05 2020-12-07 2 days
6 C X 2020-12-07 2020-12-08 1 days
7 C X 2020-12-07 2020-12-09 2 days
df1 = (df.pivot_table(index=['name','event','join_date'],
columns='d',
aggfunc='size',
fill_value=0)
.reindex(pd.timedelta_range(df['d'].min(), df['d'].max()),
axis=1,
fill_value=0))
df1.columns = [f'day_{i}' for i in range(len(df1.columns))]
df1 = df1.reset_index()
print (df1)
name event join_date day_0 day_1 day_2
0 A X 2020-12-01 2 1 0
1 A Y 2020-12-01 0 1 0
2 B X 2020-12-05 1 0 1
3 C X 2020-12-07 0 1 1