显着提高子集和汇总 pandas 数据帧的速度
drastically improve speed to subset and summarize pandas dataframe
我有一个包含购买历史的数据框
(最后一个数据框)
我必须生成他们在他们来的第一天、第二天、一周、一个月等等的购买摘要,如下所示:
purchaser
firstDay
secondDay
firstWeek
firstMonth
6months
oneyear
0
anil
1
0
0
0
1
1
mukesh
1
0
1
7
0
2
ravi
8
0
0
4
1
我所做的如下:
创建了摘要
summary=df.groupby('purchaser').agg('min').rename(columns={'date':'min'}).reset_index()
summary['oneday_date']=summary['min']+dt.timedelta(days=1)
summary['oneweek_date']=summary['min']+dt.timedelta(days=7)
summary['onemonth_date']=summary['min']+dt.timedelta(days=30)
summary['sixmonth_date']=summary['min']+dt.timedelta(days=183)
summary['year_date']=summary['min']+dt.timedelta(days=365)
然后对每个购买者进行迭代和统计。
%%time
result=[]
for num, row in summary.iterrows():
purchaser=row['purchaser']
mindate=row['min']
oneday=row['oneday_date']
oneweek=row['oneweek_date']
onemonth=row['onemonth_date']
sixmonth=row['sixmonth_date']
oneyear=row['year_date']
subdf=df[df['purchaser']==purchaser]
count0=len(subdf[(subdf['date']>=mindate) & (subdf['date']<oneday)])
count1=len(subdf[(subdf['date']>=oneday) & (subdf['date']<oneweek)])
count2=len(subdf[(subdf['date']>=oneweek) & (subdf['date']<onemonth)])
count3=len(subdf[(subdf['date']>=onemonth) & (subdf['date']<sixmonth)])
count4=len(subdf[(subdf['date']>=sixmonth) & (subdf['date']<oneyear)])
count5=len(subdf[subdf['date']>=oneyear])
result.append([purchaser,count0,count1,count2,count3,count4,count5])
CPU times: user 13.2 ms, sys: 587 µs, total: 13.8 ms
Wall time: 11.9 ms
我的实际数据比这个大100万倍
我已经尝试过的是
- 在
df=df.set_index('date')
日期索引数据框
- 在
dates
上对 subdf
进行排序
两者都没有带来任何速度提升
完整数据
df=pd.DataFrame({'purchaser':['anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi'],
'article':['pencil', 'pencil', 'pencil', 'pencil', 'rubber', 'rubber', 'rubber', 'rubber', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'pencil', 'pencil', 'rubber', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'pencil', 'pencil', 'pencil', 'pencil', 'pencil', 'pencil', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'rubber'],
'date':[1611316328000000000, 1612432758000000000, 1616319170000000000, 1622455063000000000, 1604242496000000000, 1604245635000000000, 1605421133000000000, 1570823168000000000, 1594919491000000000, 1604248351000000000, 1604237937000000000, 1604233396000000000, 1604251740000000000, 1601216201000000000, 1604232509000000000, 1604249925000000000, 1604246581000000000, 1603559931000000000, 1603946050000000000, 1603956529000000000, 1604228447000000000, 1604233557000000000, 1604212924000000000, 1604212924000000000, 1604212924000000000, 1612539904000000000, 1614939815000000000, 1614964750000000000, 1621581174000000000, 1604218928000000000, 1604222345000000000, 1604239015000000000, 1613635361000000000, 1604208994000000000]})
df['date']=pd.to_datetime(df['date'])
算法:
- 使用
.groupby()
计算每个购买者的第一个购买日期,然后计算第二个日期、第一周、...一年的日期。将其保存在第二个数据框中。
- 将这些日期左连接到包含“购买者”列中所有购买的原始数据集。
- 根据全购集df中的这些日期列计算出想要的列。这现在可以使用向量化操作来完成,而不是遍历整个数组,从而使 运行 时间更快。
.groupby()
在 purchaser 上并对每列的计数求和以产生最终所需的输出。
我有一个包含购买历史的数据框 (最后一个数据框)
我必须生成他们在他们来的第一天、第二天、一周、一个月等等的购买摘要,如下所示:
purchaser | firstDay | secondDay | firstWeek | firstMonth | 6months | oneyear |
---|---|---|---|---|---|---|
0 | anil | 1 | 0 | 0 | 0 | 1 |
1 | mukesh | 1 | 0 | 1 | 7 | 0 |
2 | ravi | 8 | 0 | 0 | 4 | 1 |
我所做的如下: 创建了摘要
summary=df.groupby('purchaser').agg('min').rename(columns={'date':'min'}).reset_index()
summary['oneday_date']=summary['min']+dt.timedelta(days=1)
summary['oneweek_date']=summary['min']+dt.timedelta(days=7)
summary['onemonth_date']=summary['min']+dt.timedelta(days=30)
summary['sixmonth_date']=summary['min']+dt.timedelta(days=183)
summary['year_date']=summary['min']+dt.timedelta(days=365)
然后对每个购买者进行迭代和统计。
%%time
result=[]
for num, row in summary.iterrows():
purchaser=row['purchaser']
mindate=row['min']
oneday=row['oneday_date']
oneweek=row['oneweek_date']
onemonth=row['onemonth_date']
sixmonth=row['sixmonth_date']
oneyear=row['year_date']
subdf=df[df['purchaser']==purchaser]
count0=len(subdf[(subdf['date']>=mindate) & (subdf['date']<oneday)])
count1=len(subdf[(subdf['date']>=oneday) & (subdf['date']<oneweek)])
count2=len(subdf[(subdf['date']>=oneweek) & (subdf['date']<onemonth)])
count3=len(subdf[(subdf['date']>=onemonth) & (subdf['date']<sixmonth)])
count4=len(subdf[(subdf['date']>=sixmonth) & (subdf['date']<oneyear)])
count5=len(subdf[subdf['date']>=oneyear])
result.append([purchaser,count0,count1,count2,count3,count4,count5])
CPU times: user 13.2 ms, sys: 587 µs, total: 13.8 ms
Wall time: 11.9 ms
我的实际数据比这个大100万倍
我已经尝试过的是
- 在
df=df.set_index('date')
日期索引数据框
- 在
dates
上对
subdf
进行排序
两者都没有带来任何速度提升
完整数据
df=pd.DataFrame({'purchaser':['anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi'],
'article':['pencil', 'pencil', 'pencil', 'pencil', 'rubber', 'rubber', 'rubber', 'rubber', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'pencil', 'pencil', 'rubber', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'pencil', 'pencil', 'pencil', 'pencil', 'pencil', 'pencil', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'rubber'],
'date':[1611316328000000000, 1612432758000000000, 1616319170000000000, 1622455063000000000, 1604242496000000000, 1604245635000000000, 1605421133000000000, 1570823168000000000, 1594919491000000000, 1604248351000000000, 1604237937000000000, 1604233396000000000, 1604251740000000000, 1601216201000000000, 1604232509000000000, 1604249925000000000, 1604246581000000000, 1603559931000000000, 1603946050000000000, 1603956529000000000, 1604228447000000000, 1604233557000000000, 1604212924000000000, 1604212924000000000, 1604212924000000000, 1612539904000000000, 1614939815000000000, 1614964750000000000, 1621581174000000000, 1604218928000000000, 1604222345000000000, 1604239015000000000, 1613635361000000000, 1604208994000000000]})
df['date']=pd.to_datetime(df['date'])
算法:
- 使用
.groupby()
计算每个购买者的第一个购买日期,然后计算第二个日期、第一周、...一年的日期。将其保存在第二个数据框中。 - 将这些日期左连接到包含“购买者”列中所有购买的原始数据集。
- 根据全购集df中的这些日期列计算出想要的列。这现在可以使用向量化操作来完成,而不是遍历整个数组,从而使 运行 时间更快。
.groupby()
在 purchaser 上并对每列的计数求和以产生最终所需的输出。