如何在不考虑额外列的情况下保留每个类别的最早记录？

Question

假设我有一个包含 3 列的数据 table：

Category             Color              Date
triangle             red                2017-10-10
square               yellow             2017-11-10
triangle             blue               2017-02-10
circle               yellow             2017-07-10
circle               red                2017-09-10

我想按每个类别找出最早的日期。所以我想要的输出是：

Category             Color              Date
square               yellow             2017-11-10
triangle             blue               2017-02-10
circle               yellow             2017-07-10

我浏览了几篇关于如何执行此操作的帖子：

Finding the min date in a Pandas DF row and create new Column

还有更多。

一种流行的方法是 groupby 方法：

df.groupby('Category').first().reset_index()

但是如果我使用这种方法，那么它将按 Category 分组，但它会保留 triangle 的两条记录，因为它有两种不同的颜色。

有没有更好更有效的方法呢？

Answer 1

您可以使用 sort_values + drop_duplicates:

df.sort_values(['Date']).drop_duplicates('Category', keep='first')

   Category   Color        Date
2  triangle    blue  2017-02-10
3    circle  yellow  2017-07-10
1    square  yellow  2017-11-10

如果您想保留 Category 的原始顺序，您需要对 groupby 调用进行排序：

df.groupby('Category', group_keys=False, sort=False)\
  .apply(lambda x: x.sort_values('Date'))\
  .drop_duplicates('Category', keep='first')

   Category   Color        Date
2  triangle    blue  2017-02-10
1    square  yellow  2017-11-10
3    circle  yellow  2017-07-10

Answer 2

下面应该会给你想要的输出；与您发布的内容进行比较，我首先根据日期对值进行排序，因为您希望保留每个类别的最早日期：

df.sort_values('Date').groupby('Category').first().reset_index()

这给出了所需的输出：

   Category   Color        Date
0    circle  yellow  2017-07-10
1    square  yellow  2017-11-10
2  triangle    blue  2017-02-10

编辑

感谢评论中的@Wen，通过以下方式也可以提高此调用的效率：

df.sort_values('Date').groupby('Category', as_index=False).first()

这也给出了

   Category   Color        Date
0    circle  yellow  2017-07-10
1    square  yellow  2017-11-10
2  triangle    blue  2017-02-10

Answer 3

head 将 return 你原来的专栏

df.sort_values(['Date']).groupby('Category').head(1)
Out[156]: 
   Category   Color        Date
2  triangle    blue  2017-02-10
3    circle  yellow  2017-07-10
1    square  yellow  2017-11-10

nth 还有：

df.sort_values(['Date']).groupby('Category',as_index=False).nth(0)
Out[158]: 
   Category   Color        Date
2  triangle    blue  2017-02-10
3    circle  yellow  2017-07-10
1    square  yellow  2017-11-10

或idxmin

df.loc[df.groupby('Category').Date.idxmin()]
Out[166]: 
   Category   Color       Date
3    circle  yellow 2017-07-10
1    square  yellow 2017-11-10
2  triangle    blue 2017-02-10

如何在不考虑额外列的情况下保留每个类别的最早记录？

How to keep earliest record for each category but without considering the extra columns?

python

pandas

pandas-groupby