聚合 pandas 数据框并删除不需要的行

Aggregating a pandas data frame and deleting non required rows

我有一个数据框,我想在其上执行聚合并根据特定条件删除一些不需要的行

ID  Type                     Band      Event            Date        Function       Title                Country 
1   Lead  Jr                   L       Hire             07/06/2016  PM          Lead Product Specialist India 
1   Lead  Jr                   L       Job Change       01/03/2019  PM          Lead Product Specialist India
1   Lead  Jr                   L       Job Change       01/03/2019  PM          Lead Product Specialist India
1   Lead  Sr                   S       Promotion        25/07/2019  PM          Lead Project Manager    India
2   Trainee                    P       Job Change       25/07/2016  AM          Trainee                 Australia
2   SW Developer               L       Promotion        25/07/2017  AM          Developer Lead          Australia
2   SW Developer               L       Job Change       25/07/2018  AM          Developer Lead          Australia
2   Lead  Specialist           S       Promotion        25/07/2019  AM          Lead Project Manager    Australia
3   Lead  Specialist           S       Promotion        25/10/2019  AM          Lead Project Manager    Australia
4   Sr  Specialist             S       Promotion        25/11/2019  AM          Lead Project Manager    Australia

我想要数据的以下输出:

ID  Type                Band       Event            Date        Function       Title               Country 
1   Lead  Jr             L         Job Change    01/03/2019     PM       Lead Product Specialist     India
1   Lead  Sr             S         Promotion     25/07/2019     PM       Lead Project Manager        India
2   Trainee              P         Job Change    25/07/2016     AM       Trainee                   Australia
2   SW Developer         L         Job Change    25/07/2018     AM       Developer Lead            Australia
2   Lead  Specialist     L         Promotion     25/07/2019     AM       Lead Project Manager      Australia
3   Lead  Specialist     S         Promotion     25/10/2019     AM       Lead Project Manager      Australia
4   Sr  Specialist       S         Promotion     25/11/2019     AM       Lead Project Manager      Australia 

所以基本上逻辑是我需要在类型和波段级别分组并获取具有最新日期的记录,即最新记录。因此,如果 Band = "L" 和 Type = "Lead Jr" 的三个记录具有三个不同的日期,那么我需要将最新的记录作为这三个日期的基础,依此类推。

如果您按日期对数据帧进行反向排序,那么在每个组中,数据也会以这种方式排序,因此您可以安全地取第一个。

df.sort_values("Date", ascending=False).groupby(["Type", "Band"]).first()
# date to datetime
df.Date = pd.to_datetime(df.Date)

# depending on the data, optionally sort
df.sort_values(['ID', 'Type', 'Date'], inplace=True)

# drop_duplicates with keep='last'
df.drop_duplicates(['ID', 'Type', 'Band'], keep='last')  # optionally add .reset_index(drop=True)

排序和 drop_duplicates 作为单行

df.sort_values(['ID', 'Type', 'Date']).drop_duplicates(['ID', 'Type', 'Band'], keep='last')

结果

   ID              Type Band       Event       Date Function                    Title   Country 
2   1          Lead  Jr    L  Job Change 2019-01-03       PM  Lead Product Specialist      India
3   1          Lead  Sr    S   Promotion 2019-07-25       PM     Lead Project Manager      India
7   2  Lead  Specialist    S   Promotion 2019-07-25       AM     Lead Project Manager  Australia
6   2      SW Developer    L  Job Change 2018-07-25       AM           Developer Lead  Australia
4   2           Trainee    P  Job Change 2016-07-25       AM                  Trainee  Australia
8   3  Lead  Specialist    S   Promotion 2019-10-25       AM     Lead Project Manager  Australia
9   4    Sr  Specialist    S   Promotion 2019-11-25       AM     Lead Project Manager  Australia