显着提高子集和汇总 pandas 数据帧的速度

drastically improve speed to subset and summarize pandas dataframe

我有一个包含购买历史的数据框 (最后一个数据框)

我必须生成他们在他们来的第一天、第二天、一周、一个月等等的购买摘要,如下所示:

purchaser firstDay secondDay firstWeek firstMonth 6months oneyear
0 anil 1 0 0 0 1
1 mukesh 1 0 1 7 0
2 ravi 8 0 0 4 1

我所做的如下: 创建了摘要

summary=df.groupby('purchaser').agg('min').rename(columns={'date':'min'}).reset_index()
summary['oneday_date']=summary['min']+dt.timedelta(days=1)
summary['oneweek_date']=summary['min']+dt.timedelta(days=7)
summary['onemonth_date']=summary['min']+dt.timedelta(days=30)
summary['sixmonth_date']=summary['min']+dt.timedelta(days=183)
summary['year_date']=summary['min']+dt.timedelta(days=365)

然后对每个购买者进行迭代和统计。

%%time
result=[]
for num, row in summary.iterrows():
    purchaser=row['purchaser']
    mindate=row['min']
    oneday=row['oneday_date']
    oneweek=row['oneweek_date']
    onemonth=row['onemonth_date']
    sixmonth=row['sixmonth_date']
    oneyear=row['year_date']
    
    subdf=df[df['purchaser']==purchaser]
    
    count0=len(subdf[(subdf['date']>=mindate) & (subdf['date']<oneday)])
    count1=len(subdf[(subdf['date']>=oneday) & (subdf['date']<oneweek)])
    count2=len(subdf[(subdf['date']>=oneweek) & (subdf['date']<onemonth)])
    count3=len(subdf[(subdf['date']>=onemonth) & (subdf['date']<sixmonth)])
    count4=len(subdf[(subdf['date']>=sixmonth) & (subdf['date']<oneyear)])
    count5=len(subdf[subdf['date']>=oneyear])
    
    result.append([purchaser,count0,count1,count2,count3,count4,count5])

CPU times: user 13.2 ms, sys: 587 µs, total: 13.8 ms
Wall time: 11.9 ms

我的实际数据比这个大100万倍

我已经尝试过的是

两者都没有带来任何速度提升

完整数据

df=pd.DataFrame({'purchaser':['anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi'],
'article':['pencil', 'pencil', 'pencil', 'pencil', 'rubber', 'rubber', 'rubber', 'rubber', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'pencil', 'pencil', 'rubber', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'pencil', 'pencil', 'pencil', 'pencil', 'pencil', 'pencil', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'rubber'],
'date':[1611316328000000000, 1612432758000000000, 1616319170000000000, 1622455063000000000, 1604242496000000000, 1604245635000000000, 1605421133000000000, 1570823168000000000, 1594919491000000000, 1604248351000000000, 1604237937000000000, 1604233396000000000, 1604251740000000000, 1601216201000000000, 1604232509000000000, 1604249925000000000, 1604246581000000000, 1603559931000000000, 1603946050000000000, 1603956529000000000, 1604228447000000000, 1604233557000000000, 1604212924000000000, 1604212924000000000, 1604212924000000000, 1612539904000000000, 1614939815000000000, 1614964750000000000, 1621581174000000000, 1604218928000000000, 1604222345000000000, 1604239015000000000, 1613635361000000000, 1604208994000000000]})
df['date']=pd.to_datetime(df['date'])

算法:

  1. 使用 .groupby() 计算每个购买者的第一个购买日期,然后计算第二个日期、第一周、...一年的日期。将其保存在第二个数据框中。
  2. 将这些日期左连接到包含“购买者”列中所有购买的原始数据集。
  3. 根据全购集df中的这些日期列计算出想要的列。这现在可以使用向量化操作来完成,而不是遍历整个数组,从而使 运行 时间更快。
  4. .groupby() 在 purchaser 上并对每列的计数求和以产生最终所需的输出。