获取分组数据框中列的最大计数

Question

我的数据框 df 是：

    Election Year   Votes   Party   Region
  0   2000           50      A       a
  1   2000           100     B       a
  2   2000           26      A       b
  3   2000           180     B       b
  4   2000           300     A       c
  5   2000           46      C       c
  6   2005           149     A       a
  7   2005           46      B       a
  8   2005           312     A       b
  9   2005           23      B       b
  10  2005           16      A       c
  11  2005           35      C       c

我想每年都获得党的胜利最大区域。所以期望的输出是：

 Election Year Party
   2000         B
   2005         A

我试过这段代码来获得上面的输出，但它给出了错误：

 winner = df.groupby(['Election Year'])['Votes'].max().reset_index()
 winner = winner.groupby('Election Year').first().reset_index()
 winner = winner[['Election Year', 'Party']].to_string(index=False)
 winner

如何获得所需的输出？

Answer 1

我相信一个班轮 df.groupby(["Election Year"]).max().reset_index()['Election Year', 'Party'] 可以解决您的问题

Answer 2

这是嵌套 groupby 的一种方法。我们首先计算每个年份-地区对中每个政党的选票，然后使用 mode 找到赢得最多地区的政党。模式不必是唯一的（如果两方或多方赢得相同数量的地区）。

df.groupby(["Year", "Region"])\
  .apply(lambda gp: gp.groupby("Party").Votes.sum().idxmax())\
  .unstack().mode(1).rename(columns={0: "Party"})

     Party
Year      
2000     B
2005     A

要解决该评论，您可以将上面的 idxmax 替换为 nlargest 和 diff 以查找赢率低于给定数字的区域。

margin = df.groupby(["Year", "Region"])\
  .apply(lambda gp: gp.groupby("Party").Votes.sum().nlargest(2).diff()) > -125

print(margin[margin].reset_index()[["Year", "Region"]])

#    Year Region
# 0  2000      a
# 1  2005      a
# 2  2005      c

Answer 3

试试这个

winner = df.groupby(['Election Year','Party'])['Votes'].max().reset_index()
winner.drop('Votes', axis = 1, inplace = True)
winner

Answer 4

你可以用GroupBy.idxmax()得到每组Election Year最大Votes的索引，然后用.loc定位行，然后选择所需的行列，如下：

df.loc[df.groupby('Election Year')['Votes'].idxmax()][['Election Year', 'Party']]

结果：

   Election Year Party
4           2000     A
8           2005     A

编辑

如果我们要让Party赢得最多Region，我们可以使用下面的代码（不使用带有lambda函数的慢速.apply()）：

(df.loc[
    df.groupby(['Election Year', 'Region'])['Votes'].idxmax()]
    [['Election Year', 'Party', 'Region']]
    .pivot(index='Election Year', columns='Region')
    .mode(axis=1)
).rename({0: 'Party'}, axis=1).reset_index()

结果：

   Election Year Party
0           2000     B
1           2005     A

Answer 5

另一种方法：（实际上关闭@hilberts_drinking_problem）

>>> df.groupby(["Election Year", "Region"]) \
      .apply(lambda x: x.loc[x["Votes"].idxmax(), "Party"]) \
      .unstack().mode(axis="columns") \
      .rename(columns={0: "Party"}).reset_index()

   Election Year Party
0           2000     B
1           2005     A

获取分组数据框中列的最大计数

Getting maximum counts of a column in grouped dataframe

python

dataframe

pandas

data-science

编辑