一种使 [apply + lambda + loc] 高效的方法

A way to make the [apply + lambda + loc] efficient

从 .csv 加载到 pandas 中的数据框(包含 2016 年选举数据)具有以下结构:

In [2]: df
Out[2]: 
   county  candidate  votes  ...
0  Ada     Trump      10000  ...
1  Ada     Clinton    900    ... 
2  Adams   Trump      12345  ...
.
.
n  Total   ...        ...    ...

我们的想法是计算出支持候选人 X 的票数百分比最高的前 X 个县(删除总数)

例如假设我们要100个县,候选人是特朗普,要进行的操作是:100 * sum of votes for Trump / total votes

我已经实现了以下代码,得到了正确的结果:

In [3]: (df.groupby(by="county")
           .apply(lambda x: 100 * x.loc[(x.candidate == "Trump") 
                  & (~x.county == "Total"), "votes"].sum() / x.votes.sum())
           .nlargest(100) 
           .reset_index(name='percentage'))
Out[3]: 
   county   percentage
0  Hayes    91.82
1  WALLACE  90.35
2  Arthur   89.37
.
.
99 GRANT    79.10

使用 %%time 我发现它很慢:

Out[3]: 
CPU times: user 964 ms, sys: 24 ms, total: 988 ms
Wall time: 943 ms

有没有办法让它更快?

你可以试试:

  • 假设您没有包含所有投票总和的'Total'行:
(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df['votes'].sum()*100).nlargest(100, 'votes')
  • 假设您一个'Total'行,其中包含所有投票的总和:
(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df.loc[df['candidate'] != 'Total', 'votes'].sum()*100).nlargest(100, 'votes')

我无法测试它,因为我没有数据,但它没有使用任何 apply可以 提高性能

要重命名列,您可以在末尾使用 .rename(columns={'votes':'percentage'})

您可以尝试修改您的代码以仅使用矢量化操作来加速该过程,如下所示:

df1 = df.loc[(df.county != "Total")]     # exclude the Total row(s)

df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum()     # calculate percentage for each candidate

df3 = df2.nlargest(100).reset_index(name='percentage')   # get the largest 100

df3.loc[df3.candidate == "Trump"]        # Finally, filter by candidate

编辑:

如果你想要百分比最高的前100个县,你可以稍微改变下面的代码:

df1 = df.loc[(df.county != "Total")]     # exclude the Total row(s)

df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum()     # calculate percentage for each candidate

df3a = df2.reset_index(name='percentage')   # get the percentage

df3a.loc[df3a.candidate == "Trump"].nlargest(100, 'percentage')       # Finally, filter by candidate and get the top 100 counties with highest percentages for the candidate