按 pandas 对组内的分类值进行排序
Sort Categorial values within groupby in pandas
我有这个例子 df:
df3 = pd.DataFrame({'Customer':['Sara','John','Didi','Sara','Didi' ,'Didi'],
'Date': ['15-12-2021', '1-1-2022' , '1-3-2022','15-3-2022', '1-1-2022' , '1-4-2022'],
'Month': ['December-2021', 'January-2022', 'March-2022','March-2022', 'January-2022', 'April-2022'],
'Product': ['grocery','electronics','personal-care','grocery','electronics','personal-care'],
'status': ['purchased', 'refunded', 'refunded','refunded', 'purchased', 'refunded']
})
df3
给出:
Customer Date Month Product status
0 Sara 15-12-2021 December-2021 grocery purchased
1 John 1-1-2022 January-2022 electronics refunded
2 Didi 1-3-2022 March-2022 personal-care refunded
3 Sara 15-3-2022 March-2022 grocery refunded
4 Didi 1-1-2022 January-2022 electronics purchased
5 Didi 1-4-2022 April-2022 personal-care refunded
我正在尝试按客户、产品和月份分组并获得第一个状态,然后我希望分组依据按月份列排序:
df3.sort_values('Month').groupby(['Customer','Product','Month','Date']).agg({'status':'first'}).reset_index()
我得到了:
Customer Product Month Date status
0 Didi electronics January-2022 1-1-2022 purchased
1 Didi personal-care April-2022 1-4-2022 refunded
2 Didi personal-care March-2022 1-3-2022 refunded
3 John electronics January-2022 1-1-2022 refunded
4 Sara grocery December-2021 15-12-2021 purchased
5 Sara grocery March-2022 15-3-2022 refunded
我预计 index 1 & 2
的顺序会颠倒,三月在四月之前,所以我尝试做的是:
months = {'December-2021':0,'January-2022':1,'February-2022':2,'March-2022':3,'April-2022':4,'May-2022':5,'June-2022':6,'July-2022':7,'August-2022':8,'September-2022':9,'October-2022':10,'November-2022':11}
然后通过排序值映射:
df3.sort_values(by=['Month'], key=lambda x: x.map(months)).groupby(['Customer','Product','Month','Date']).agg({'status':'first'}).reset_index()
但我在没有正确顺序的情况下得到了完全相同的结果
问题在于它正在对字符串进行排序,而 April
在 March
之前。您必须先将字符串转换为日期,然后对条目进行排序。例如像这样:
# Convert column Month to datetime
df3['Month'] = pd.to_datetime(df3['Month'], format='%B-%Y')
# Do your groupby
df_group = df3.sort_values('Month').groupby(['Customer','Product','Month','Date'], sort=False).first().reset_index()
# Convert column Month back to string
df_group['Month'] = df_group['Month'].dt.strftime('%B-%Y')
df_group
输出:
Customer Product Month Date status
0 Sara grocery December-2021 15-12-2021 purchased
1 Didi electronics January-2022 1-1-2022 purchased
2 John electronics January-2022 1-1-2022 refunded
3 Didi personal-care March-2022 1-3-2022 refunded
4 Sara grocery March-2022 15-3-2022 refunded
5 Didi personal-care April-2022 1-4-2022 refunded
您当前正在按字符串排序,因此 April 早于 March。
您需要转换为日期时间进行排序,这里使用 YYYY-MM.
形式的自定义键
此外,groupby
默认对组进行排序,因此您需要添加 sort=False
以防止聚合后重新排序。
(df3.assign(key=pd.to_datetime(df3['Date'], dayfirst=True).dt.strftime('%Y%M'))
.sort_values(by='key')
.groupby(['Customer','Product','Month','Date'], sort=False)
.agg({'status':'first'}).reset_index()
)
输出:
Customer Product Month Date status
0 Sara grocery December-2021 15-12-2021 purchased
1 John electronics January-2022 1-1-2022 refunded
2 Didi personal-care March-2022 1-3-2022 refunded
3 Sara grocery March-2022 15-3-2022 refunded
4 Didi electronics January-2022 1-1-2022 purchased
5 Didi personal-care April-2022 1-4-2022 refunded
您可能需要转 sort = False
df3.sort_values(by=['Month'], key=lambda x: x.map(months)).groupby(['Customer','Product','Month','Date'],sort=False).agg({'status':'first'}).reset_index()
Out[546]:
Customer Product Month Date status
0 Sara grocery December-2021 15-12-2021 purchased
1 John electronics January-2022 1-1-2022 refunded
2 Didi electronics January-2022 1-1-2022 purchased
3 Didi personal-care March-2022 1-3-2022 refunded
4 Sara grocery March-2022 15-3-2022 refunded
5 Didi personal-care April-2022 1-4-2022 refunded
df3['Month'] = pd.to_datetime(df3['Month'], infer_datetime_format=True)
df3 = df3.sort_values(by=["Month"],ascending=False).groupby(
['Customer','Product','Month','Date']).agg({
'status':'first'}).reset_index()
df3['Month'] = df3['Month'].dt.strftime('%B-%Y')
df3
你想要的输出:
Customer Product Month Date status
0 Didi electronics January-2022 1-1-2022 purchased
1 Didi personal-care March-2022 1-3-2022 refunded
2 Didi personal-care April-2022 1-4-2022 refunded
3 John electronics January-2022 1-1-2022 refunded
4 Sara grocery December-2021 15-12-2021 purchased
5 Sara grocery March-2022 15-3-2022 refunded
我有这个例子 df:
df3 = pd.DataFrame({'Customer':['Sara','John','Didi','Sara','Didi' ,'Didi'],
'Date': ['15-12-2021', '1-1-2022' , '1-3-2022','15-3-2022', '1-1-2022' , '1-4-2022'],
'Month': ['December-2021', 'January-2022', 'March-2022','March-2022', 'January-2022', 'April-2022'],
'Product': ['grocery','electronics','personal-care','grocery','electronics','personal-care'],
'status': ['purchased', 'refunded', 'refunded','refunded', 'purchased', 'refunded']
})
df3
给出:
Customer Date Month Product status
0 Sara 15-12-2021 December-2021 grocery purchased
1 John 1-1-2022 January-2022 electronics refunded
2 Didi 1-3-2022 March-2022 personal-care refunded
3 Sara 15-3-2022 March-2022 grocery refunded
4 Didi 1-1-2022 January-2022 electronics purchased
5 Didi 1-4-2022 April-2022 personal-care refunded
我正在尝试按客户、产品和月份分组并获得第一个状态,然后我希望分组依据按月份列排序:
df3.sort_values('Month').groupby(['Customer','Product','Month','Date']).agg({'status':'first'}).reset_index()
我得到了:
Customer Product Month Date status
0 Didi electronics January-2022 1-1-2022 purchased
1 Didi personal-care April-2022 1-4-2022 refunded
2 Didi personal-care March-2022 1-3-2022 refunded
3 John electronics January-2022 1-1-2022 refunded
4 Sara grocery December-2021 15-12-2021 purchased
5 Sara grocery March-2022 15-3-2022 refunded
我预计 index 1 & 2
的顺序会颠倒,三月在四月之前,所以我尝试做的是:
months = {'December-2021':0,'January-2022':1,'February-2022':2,'March-2022':3,'April-2022':4,'May-2022':5,'June-2022':6,'July-2022':7,'August-2022':8,'September-2022':9,'October-2022':10,'November-2022':11}
然后通过排序值映射:
df3.sort_values(by=['Month'], key=lambda x: x.map(months)).groupby(['Customer','Product','Month','Date']).agg({'status':'first'}).reset_index()
但我在没有正确顺序的情况下得到了完全相同的结果
问题在于它正在对字符串进行排序,而 April
在 March
之前。您必须先将字符串转换为日期,然后对条目进行排序。例如像这样:
# Convert column Month to datetime
df3['Month'] = pd.to_datetime(df3['Month'], format='%B-%Y')
# Do your groupby
df_group = df3.sort_values('Month').groupby(['Customer','Product','Month','Date'], sort=False).first().reset_index()
# Convert column Month back to string
df_group['Month'] = df_group['Month'].dt.strftime('%B-%Y')
df_group
输出:
Customer Product Month Date status
0 Sara grocery December-2021 15-12-2021 purchased
1 Didi electronics January-2022 1-1-2022 purchased
2 John electronics January-2022 1-1-2022 refunded
3 Didi personal-care March-2022 1-3-2022 refunded
4 Sara grocery March-2022 15-3-2022 refunded
5 Didi personal-care April-2022 1-4-2022 refunded
您当前正在按字符串排序,因此 April 早于 March。
您需要转换为日期时间进行排序,这里使用 YYYY-MM.
形式的自定义键此外,groupby
默认对组进行排序,因此您需要添加 sort=False
以防止聚合后重新排序。
(df3.assign(key=pd.to_datetime(df3['Date'], dayfirst=True).dt.strftime('%Y%M'))
.sort_values(by='key')
.groupby(['Customer','Product','Month','Date'], sort=False)
.agg({'status':'first'}).reset_index()
)
输出:
Customer Product Month Date status
0 Sara grocery December-2021 15-12-2021 purchased
1 John electronics January-2022 1-1-2022 refunded
2 Didi personal-care March-2022 1-3-2022 refunded
3 Sara grocery March-2022 15-3-2022 refunded
4 Didi electronics January-2022 1-1-2022 purchased
5 Didi personal-care April-2022 1-4-2022 refunded
您可能需要转 sort = False
df3.sort_values(by=['Month'], key=lambda x: x.map(months)).groupby(['Customer','Product','Month','Date'],sort=False).agg({'status':'first'}).reset_index()
Out[546]:
Customer Product Month Date status
0 Sara grocery December-2021 15-12-2021 purchased
1 John electronics January-2022 1-1-2022 refunded
2 Didi electronics January-2022 1-1-2022 purchased
3 Didi personal-care March-2022 1-3-2022 refunded
4 Sara grocery March-2022 15-3-2022 refunded
5 Didi personal-care April-2022 1-4-2022 refunded
df3['Month'] = pd.to_datetime(df3['Month'], infer_datetime_format=True)
df3 = df3.sort_values(by=["Month"],ascending=False).groupby(
['Customer','Product','Month','Date']).agg({
'status':'first'}).reset_index()
df3['Month'] = df3['Month'].dt.strftime('%B-%Y')
df3
你想要的输出:
Customer Product Month Date status
0 Didi electronics January-2022 1-1-2022 purchased
1 Didi personal-care March-2022 1-3-2022 refunded
2 Didi personal-care April-2022 1-4-2022 refunded
3 John electronics January-2022 1-1-2022 refunded
4 Sara grocery December-2021 15-12-2021 purchased
5 Sara grocery March-2022 15-3-2022 refunded