根据列表中给出的优先级保留重复行
Retain the duplicate row based on priority given in the list
我有一个数据框
df = pd.DataFrame([["A","Q",98,56],["C","S",18,45], ["B","T",79,54], ["A","P",98,56],["C","R",18,45],["B","S",79,54], ["A","R",84,65],["B","Q",79,54],["C","Q",19,44]], columns=["id","prio","c1","c2"])
我有一个列表
Priority = ["P","R","Q","S","T"]
Select 基于 id,c1,c2 的重复行。
如果我们发现重复行,则根据列 prio 中存在的列表中给出的优先级保留这些行。
例:对于id为A的重复行,如果prio列中存在P和Q,则优先处理P并删除其他行,同理对于id为B的重复行,列中存在T、S、Q prio,因为在 T、S、Q 中,Q 在列表中排在第一位。所以保留Q排。
预期输出:
df_out = pd.DataFrame([["A","P",98,56],["C","R",18,45], ["A","R",84,65],["B","Q",79,54],["C","Q",19,44]], columns=["id","prio","c1","c2"])
怎么做?
您可以将值转换为有序分类,然后使用 DataFrame.sort_values
with DataFrame.drop_duplicates
:
df['prio'] = pd.Categorical(df['prio'], categories=Priority, ordered=True)
df = df.sort_values('prio').drop_duplicates(['id','c1','c2'])
print (df)
id prio c1 c2
3 A P 98 56
4 C R 18 45
6 A R 84 65
7 B Q 79 54
8 C Q 19 44
我有一个数据框
df = pd.DataFrame([["A","Q",98,56],["C","S",18,45], ["B","T",79,54], ["A","P",98,56],["C","R",18,45],["B","S",79,54], ["A","R",84,65],["B","Q",79,54],["C","Q",19,44]], columns=["id","prio","c1","c2"])
我有一个列表
Priority = ["P","R","Q","S","T"]
Select 基于 id,c1,c2 的重复行。 如果我们发现重复行,则根据列 prio 中存在的列表中给出的优先级保留这些行。
例:对于id为A的重复行,如果prio列中存在P和Q,则优先处理P并删除其他行,同理对于id为B的重复行,列中存在T、S、Q prio,因为在 T、S、Q 中,Q 在列表中排在第一位。所以保留Q排。
预期输出:
df_out = pd.DataFrame([["A","P",98,56],["C","R",18,45], ["A","R",84,65],["B","Q",79,54],["C","Q",19,44]], columns=["id","prio","c1","c2"])
怎么做?
您可以将值转换为有序分类,然后使用 DataFrame.sort_values
with DataFrame.drop_duplicates
:
df['prio'] = pd.Categorical(df['prio'], categories=Priority, ordered=True)
df = df.sort_values('prio').drop_duplicates(['id','c1','c2'])
print (df)
id prio c1 c2
3 A P 98 56
4 C R 18 45
6 A R 84 65
7 B Q 79 54
8 C Q 19 44