聚合 pandas 数据框并删除不需要的行
Aggregating a pandas data frame and deleting non required rows
我有一个数据框,我想在其上执行聚合并根据特定条件删除一些不需要的行
ID Type Band Event Date Function Title Country
1 Lead Jr L Hire 07/06/2016 PM Lead Product Specialist India
1 Lead Jr L Job Change 01/03/2019 PM Lead Product Specialist India
1 Lead Jr L Job Change 01/03/2019 PM Lead Product Specialist India
1 Lead Sr S Promotion 25/07/2019 PM Lead Project Manager India
2 Trainee P Job Change 25/07/2016 AM Trainee Australia
2 SW Developer L Promotion 25/07/2017 AM Developer Lead Australia
2 SW Developer L Job Change 25/07/2018 AM Developer Lead Australia
2 Lead Specialist S Promotion 25/07/2019 AM Lead Project Manager Australia
3 Lead Specialist S Promotion 25/10/2019 AM Lead Project Manager Australia
4 Sr Specialist S Promotion 25/11/2019 AM Lead Project Manager Australia
我想要数据的以下输出:
ID Type Band Event Date Function Title Country
1 Lead Jr L Job Change 01/03/2019 PM Lead Product Specialist India
1 Lead Sr S Promotion 25/07/2019 PM Lead Project Manager India
2 Trainee P Job Change 25/07/2016 AM Trainee Australia
2 SW Developer L Job Change 25/07/2018 AM Developer Lead Australia
2 Lead Specialist L Promotion 25/07/2019 AM Lead Project Manager Australia
3 Lead Specialist S Promotion 25/10/2019 AM Lead Project Manager Australia
4 Sr Specialist S Promotion 25/11/2019 AM Lead Project Manager Australia
所以基本上逻辑是我需要在类型和波段级别分组并获取具有最新日期的记录,即最新记录。因此,如果 Band = "L" 和 Type = "Lead Jr" 的三个记录具有三个不同的日期,那么我需要将最新的记录作为这三个日期的基础,依此类推。
如果您按日期对数据帧进行反向排序,那么在每个组中,数据也会以这种方式排序,因此您可以安全地取第一个。
df.sort_values("Date", ascending=False).groupby(["Type", "Band"]).first()
# date to datetime
df.Date = pd.to_datetime(df.Date)
# depending on the data, optionally sort
df.sort_values(['ID', 'Type', 'Date'], inplace=True)
# drop_duplicates with keep='last'
df.drop_duplicates(['ID', 'Type', 'Band'], keep='last') # optionally add .reset_index(drop=True)
排序和 drop_duplicates 作为单行
df.sort_values(['ID', 'Type', 'Date']).drop_duplicates(['ID', 'Type', 'Band'], keep='last')
结果
ID Type Band Event Date Function Title Country
2 1 Lead Jr L Job Change 2019-01-03 PM Lead Product Specialist India
3 1 Lead Sr S Promotion 2019-07-25 PM Lead Project Manager India
7 2 Lead Specialist S Promotion 2019-07-25 AM Lead Project Manager Australia
6 2 SW Developer L Job Change 2018-07-25 AM Developer Lead Australia
4 2 Trainee P Job Change 2016-07-25 AM Trainee Australia
8 3 Lead Specialist S Promotion 2019-10-25 AM Lead Project Manager Australia
9 4 Sr Specialist S Promotion 2019-11-25 AM Lead Project Manager Australia
我有一个数据框,我想在其上执行聚合并根据特定条件删除一些不需要的行
ID Type Band Event Date Function Title Country
1 Lead Jr L Hire 07/06/2016 PM Lead Product Specialist India
1 Lead Jr L Job Change 01/03/2019 PM Lead Product Specialist India
1 Lead Jr L Job Change 01/03/2019 PM Lead Product Specialist India
1 Lead Sr S Promotion 25/07/2019 PM Lead Project Manager India
2 Trainee P Job Change 25/07/2016 AM Trainee Australia
2 SW Developer L Promotion 25/07/2017 AM Developer Lead Australia
2 SW Developer L Job Change 25/07/2018 AM Developer Lead Australia
2 Lead Specialist S Promotion 25/07/2019 AM Lead Project Manager Australia
3 Lead Specialist S Promotion 25/10/2019 AM Lead Project Manager Australia
4 Sr Specialist S Promotion 25/11/2019 AM Lead Project Manager Australia
我想要数据的以下输出:
ID Type Band Event Date Function Title Country
1 Lead Jr L Job Change 01/03/2019 PM Lead Product Specialist India
1 Lead Sr S Promotion 25/07/2019 PM Lead Project Manager India
2 Trainee P Job Change 25/07/2016 AM Trainee Australia
2 SW Developer L Job Change 25/07/2018 AM Developer Lead Australia
2 Lead Specialist L Promotion 25/07/2019 AM Lead Project Manager Australia
3 Lead Specialist S Promotion 25/10/2019 AM Lead Project Manager Australia
4 Sr Specialist S Promotion 25/11/2019 AM Lead Project Manager Australia
所以基本上逻辑是我需要在类型和波段级别分组并获取具有最新日期的记录,即最新记录。因此,如果 Band = "L" 和 Type = "Lead Jr" 的三个记录具有三个不同的日期,那么我需要将最新的记录作为这三个日期的基础,依此类推。
如果您按日期对数据帧进行反向排序,那么在每个组中,数据也会以这种方式排序,因此您可以安全地取第一个。
df.sort_values("Date", ascending=False).groupby(["Type", "Band"]).first()
# date to datetime
df.Date = pd.to_datetime(df.Date)
# depending on the data, optionally sort
df.sort_values(['ID', 'Type', 'Date'], inplace=True)
# drop_duplicates with keep='last'
df.drop_duplicates(['ID', 'Type', 'Band'], keep='last') # optionally add .reset_index(drop=True)
排序和 drop_duplicates 作为单行
df.sort_values(['ID', 'Type', 'Date']).drop_duplicates(['ID', 'Type', 'Band'], keep='last')
结果
ID Type Band Event Date Function Title Country
2 1 Lead Jr L Job Change 2019-01-03 PM Lead Product Specialist India
3 1 Lead Sr S Promotion 2019-07-25 PM Lead Project Manager India
7 2 Lead Specialist S Promotion 2019-07-25 AM Lead Project Manager Australia
6 2 SW Developer L Job Change 2018-07-25 AM Developer Lead Australia
4 2 Trainee P Job Change 2016-07-25 AM Trainee Australia
8 3 Lead Specialist S Promotion 2019-10-25 AM Lead Project Manager Australia
9 4 Sr Specialist S Promotion 2019-11-25 AM Lead Project Manager Australia