一种使 [apply + lambda + loc] 高效的方法

Question

从 .csv 加载到 pandas 中的数据框（包含 2016 年选举数据）具有以下结构：

In [2]: df
Out[2]: 
   county  candidate  votes  ...
0  Ada     Trump      10000  ...
1  Ada     Clinton    900    ... 
2  Adams   Trump      12345  ...
.
.
n  Total   ...        ...    ...

我们的想法是计算出支持候选人 X 的票数百分比最高的前 X 个县（删除总数）

例如假设我们要100个县，候选人是特朗普，要进行的操作是：100 * sum of votes for Trump / total votes

我已经实现了以下代码，得到了正确的结果：

In [3]: (df.groupby(by="county")
           .apply(lambda x: 100 * x.loc[(x.candidate == "Trump") 
                  & (~x.county == "Total"), "votes"].sum() / x.votes.sum())
           .nlargest(100) 
           .reset_index(name='percentage'))
Out[3]: 
   county   percentage
0  Hayes    91.82
1  WALLACE  90.35
2  Arthur   89.37
.
.
99 GRANT    79.10

使用 %%time 我发现它很慢：

Out[3]: 
CPU times: user 964 ms, sys: 24 ms, total: 988 ms
Wall time: 943 ms

有没有办法让它更快？

Answer 1

你可以试试：

假设您没有包含所有投票总和的'Total'行：

(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df['votes'].sum()*100).nlargest(100, 'votes')

假设您有一个'Total'行，其中包含所有投票的总和：

(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df.loc[df['candidate'] != 'Total', 'votes'].sum()*100).nlargest(100, 'votes')

我无法测试它，因为我没有数据，但它没有使用任何 apply，可以提高性能

要重命名列，您可以在末尾使用 .rename(columns={'votes':'percentage'})

Answer 2

您可以尝试修改您的代码以仅使用矢量化操作来加速该过程，如下所示：

df1 = df.loc[(df.county != "Total")]     # exclude the Total row(s)

df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum()     # calculate percentage for each candidate

df3 = df2.nlargest(100).reset_index(name='percentage')   # get the largest 100

df3.loc[df3.candidate == "Trump"]        # Finally, filter by candidate

编辑：

如果你想要百分比最高的前100个县，你可以稍微改变下面的代码：

df1 = df.loc[(df.county != "Total")]     # exclude the Total row(s)

df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum()     # calculate percentage for each candidate

df3a = df2.reset_index(name='percentage')   # get the percentage

df3a.loc[df3a.candidate == "Trump"].nlargest(100, 'percentage')       # Finally, filter by candidate and get the top 100 counties with highest percentages for the candidate

一种使 [apply + lambda + loc] 高效的方法

A way to make the [apply + lambda + loc] efficient

python

lambda

pandas

google-colaboratory

编辑：