根据列表中给出的优先级保留重复行

Retain the duplicate row based on priority given in the list

我有一个数据框

df = pd.DataFrame([["A","Q",98,56],["C","S",18,45], ["B","T",79,54], ["A","P",98,56],["C","R",18,45],["B","S",79,54], ["A","R",84,65],["B","Q",79,54],["C","Q",19,44]], columns=["id","prio","c1","c2"])

我有一个列表

Priority = ["P","R","Q","S","T"]

Select 基于 id,c1,c2 的重复行。 如果我们发现重复行,则根据列 prio 中存在的列表中给出的优先级保留这些行。

例:对于id为A的重复行,如果prio列中存在P和Q,则优先处理P并删除其他行,同理对于id为B的重复行,列中存在T、S、Q prio,因为在 T、S、Q 中,Q 在列表中排在第一位。所以保留Q排。

预期输出:

df_out = pd.DataFrame([["A","P",98,56],["C","R",18,45], ["A","R",84,65],["B","Q",79,54],["C","Q",19,44]], columns=["id","prio","c1","c2"]) 

怎么做?

您可以将值转换为有序分类,然后使用 DataFrame.sort_values with DataFrame.drop_duplicates:

df['prio'] = pd.Categorical(df['prio'], categories=Priority, ordered=True)
df = df.sort_values('prio').drop_duplicates(['id','c1','c2'])
print (df)
  id prio  c1  c2
3  A    P  98  56
4  C    R  18  45
6  A    R  84  65
7  B    Q  79  54
8  C    Q  19  44