为什么groupby这么快?
Why is groupby so fast?
这是 的后续问题,其中 jezrael 使用 pandas.DataFrame.groupby 将列表创建速度提高了数百倍。具体来说,设 df
为大数据帧,则
index = list(set(df.index))
list_df = [df.loc(x) for x in index]
和
list_df = [x for i,x in df.groupby(level=0, sort=False)]
产生相同的结果,后者比前者快200多倍,甚至忽略列表创建步骤。为什么?
如果有人能让我理解为什么会有如此巨大的性能差异,我将非常高兴。提前致谢!
编辑: 正如 Alex Riley 在他的评论中所建议的,我确认测试已经 运行 在具有非唯一和非单调索引的数据帧上进行.
因为您的数据框未按索引排序,这意味着所有子集化都必须使用慢速矢量扫描来完成,并且无法应用像 二进制搜索 这样的快速算法;虽然 groupby
总是首先按变量组对数据帧进行排序,但您可以通过编写一个简单的算法来模拟此行为,该算法对索引进行排序,然后进行子集验证:
def sort_subset(df):
# sort index and find out the positions that separate groups
df = df.sort_index()
split_indices = np.flatnonzero(np.ediff1d(df.index, to_begin=1, to_end=1))
list_df = []
for i in range(len(split_indices)-1):
start_index = split_indices[i]
end_index = split_indices[i+1]
list_df.append(df.iloc[start_index:end_index])
return list_df
一些时机:
import pandas as pd
import numpy as np
nrow = 1000000
df = pd.DataFrame(np.random.randn(nrow), columns=['x'], index=np.random.randint(100, size=nrow))
index = list(set(df.index))
print('no of groups: ', len(index))
%timeit list_df_1 = [df.loc[x] for x in index]
#no of groups: 100
#13.6 s ± 228 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list_df_2 = [x for i, x in df.groupby(level=0, sort=False)]
#54.8 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Not as fast because my algorithm is not optimized at all but the same order of magnitude
%timeit list_df_3 = sort_subset(df)
#102 ms ± 3.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
list_df_1 = [df.loc[x] for x in index]
list_df_2 = [x for i, x in df.groupby(level=0, sort=False)]
list_df_3 = sort_subset(df)
比较结果:
all(list_df_3[i].eq(list_df_2[i]).all().iat[0] for i in range(len(list_df_2)))
# True
如果在子集化之前对索引进行排序,速度也会显着提高:
def sort_subset_with_loc(df):
df = df.sort_index()
list_df_1 = [df.loc[x] for x in index]
return list_df_1
%timeit sort_subset_with_loc(df)
# 25.4 ms ± 897 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
这是 df
为大数据帧,则
index = list(set(df.index))
list_df = [df.loc(x) for x in index]
和
list_df = [x for i,x in df.groupby(level=0, sort=False)]
产生相同的结果,后者比前者快200多倍,甚至忽略列表创建步骤。为什么?
如果有人能让我理解为什么会有如此巨大的性能差异,我将非常高兴。提前致谢!
编辑: 正如 Alex Riley 在他的评论中所建议的,我确认测试已经 运行 在具有非唯一和非单调索引的数据帧上进行.
因为您的数据框未按索引排序,这意味着所有子集化都必须使用慢速矢量扫描来完成,并且无法应用像 二进制搜索 这样的快速算法;虽然 groupby
总是首先按变量组对数据帧进行排序,但您可以通过编写一个简单的算法来模拟此行为,该算法对索引进行排序,然后进行子集验证:
def sort_subset(df):
# sort index and find out the positions that separate groups
df = df.sort_index()
split_indices = np.flatnonzero(np.ediff1d(df.index, to_begin=1, to_end=1))
list_df = []
for i in range(len(split_indices)-1):
start_index = split_indices[i]
end_index = split_indices[i+1]
list_df.append(df.iloc[start_index:end_index])
return list_df
一些时机:
import pandas as pd
import numpy as np
nrow = 1000000
df = pd.DataFrame(np.random.randn(nrow), columns=['x'], index=np.random.randint(100, size=nrow))
index = list(set(df.index))
print('no of groups: ', len(index))
%timeit list_df_1 = [df.loc[x] for x in index]
#no of groups: 100
#13.6 s ± 228 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list_df_2 = [x for i, x in df.groupby(level=0, sort=False)]
#54.8 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Not as fast because my algorithm is not optimized at all but the same order of magnitude
%timeit list_df_3 = sort_subset(df)
#102 ms ± 3.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
list_df_1 = [df.loc[x] for x in index]
list_df_2 = [x for i, x in df.groupby(level=0, sort=False)]
list_df_3 = sort_subset(df)
比较结果:
all(list_df_3[i].eq(list_df_2[i]).all().iat[0] for i in range(len(list_df_2)))
# True
如果在子集化之前对索引进行排序,速度也会显着提高:
def sort_subset_with_loc(df):
df = df.sort_index()
list_df_1 = [df.loc[x] for x in index]
return list_df_1
%timeit sort_subset_with_loc(df)
# 25.4 ms ± 897 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)