一种使 [apply + lambda + loc] 高效的方法
A way to make the [apply + lambda + loc] efficient
从 .csv 加载到 pandas 中的数据框(包含 2016 年选举数据)具有以下结构:
In [2]: df
Out[2]:
county candidate votes ...
0 Ada Trump 10000 ...
1 Ada Clinton 900 ...
2 Adams Trump 12345 ...
.
.
n Total ... ... ...
我们的想法是计算出支持候选人 X 的票数百分比最高的前 X 个县(删除总数)
例如假设我们要100个县,候选人是特朗普,要进行的操作是:100 * sum of votes for Trump / total votes
我已经实现了以下代码,得到了正确的结果:
In [3]: (df.groupby(by="county")
.apply(lambda x: 100 * x.loc[(x.candidate == "Trump")
& (~x.county == "Total"), "votes"].sum() / x.votes.sum())
.nlargest(100)
.reset_index(name='percentage'))
Out[3]:
county percentage
0 Hayes 91.82
1 WALLACE 90.35
2 Arthur 89.37
.
.
99 GRANT 79.10
使用 %%time
我发现它很慢:
Out[3]:
CPU times: user 964 ms, sys: 24 ms, total: 988 ms
Wall time: 943 ms
有没有办法让它更快?
你可以试试:
- 假设您没有包含所有投票总和的'Total'行:
(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df['votes'].sum()*100).nlargest(100, 'votes')
- 假设您有一个'Total'行,其中包含所有投票的总和:
(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df.loc[df['candidate'] != 'Total', 'votes'].sum()*100).nlargest(100, 'votes')
我无法测试它,因为我没有数据,但它没有使用任何 apply
,可以 提高性能
要重命名列,您可以在末尾使用 .rename(columns={'votes':'percentage'})
您可以尝试修改您的代码以仅使用矢量化操作来加速该过程,如下所示:
df1 = df.loc[(df.county != "Total")] # exclude the Total row(s)
df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum() # calculate percentage for each candidate
df3 = df2.nlargest(100).reset_index(name='percentage') # get the largest 100
df3.loc[df3.candidate == "Trump"] # Finally, filter by candidate
编辑:
如果你想要百分比最高的前100个县,你可以稍微改变下面的代码:
df1 = df.loc[(df.county != "Total")] # exclude the Total row(s)
df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum() # calculate percentage for each candidate
df3a = df2.reset_index(name='percentage') # get the percentage
df3a.loc[df3a.candidate == "Trump"].nlargest(100, 'percentage') # Finally, filter by candidate and get the top 100 counties with highest percentages for the candidate
从 .csv 加载到 pandas 中的数据框(包含 2016 年选举数据)具有以下结构:
In [2]: df
Out[2]:
county candidate votes ...
0 Ada Trump 10000 ...
1 Ada Clinton 900 ...
2 Adams Trump 12345 ...
.
.
n Total ... ... ...
我们的想法是计算出支持候选人 X 的票数百分比最高的前 X 个县(删除总数)
例如假设我们要100个县,候选人是特朗普,要进行的操作是:100 * sum of votes for Trump / total votes
我已经实现了以下代码,得到了正确的结果:
In [3]: (df.groupby(by="county")
.apply(lambda x: 100 * x.loc[(x.candidate == "Trump")
& (~x.county == "Total"), "votes"].sum() / x.votes.sum())
.nlargest(100)
.reset_index(name='percentage'))
Out[3]:
county percentage
0 Hayes 91.82
1 WALLACE 90.35
2 Arthur 89.37
.
.
99 GRANT 79.10
使用 %%time
我发现它很慢:
Out[3]:
CPU times: user 964 ms, sys: 24 ms, total: 988 ms
Wall time: 943 ms
有没有办法让它更快?
你可以试试:
- 假设您没有包含所有投票总和的'Total'行:
(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df['votes'].sum()*100).nlargest(100, 'votes')
- 假设您有一个'Total'行,其中包含所有投票的总和:
(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df.loc[df['candidate'] != 'Total', 'votes'].sum()*100).nlargest(100, 'votes')
我无法测试它,因为我没有数据,但它没有使用任何 apply
,可以 提高性能
要重命名列,您可以在末尾使用 .rename(columns={'votes':'percentage'})
您可以尝试修改您的代码以仅使用矢量化操作来加速该过程,如下所示:
df1 = df.loc[(df.county != "Total")] # exclude the Total row(s)
df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum() # calculate percentage for each candidate
df3 = df2.nlargest(100).reset_index(name='percentage') # get the largest 100
df3.loc[df3.candidate == "Trump"] # Finally, filter by candidate
编辑:
如果你想要百分比最高的前100个县,你可以稍微改变下面的代码:
df1 = df.loc[(df.county != "Total")] # exclude the Total row(s)
df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum() # calculate percentage for each candidate
df3a = df2.reset_index(name='percentage') # get the percentage
df3a.loc[df3a.candidate == "Trump"].nlargest(100, 'percentage') # Finally, filter by candidate and get the top 100 counties with highest percentages for the candidate