Pandas:如果在 groupby 之后基于其他列存在重复项,则根据特定列的权重保留特定行

Pandas: Retain the a particular row based on weightage given on a particular column, if duplicates are present based on other columns after groupby

我有一个数据框df

df = pd.DataFrame([["A","X",98,56,61], ["B","E",79,54,36], ["A","Y",98,56,61],["B","F",79,54,36], ["A","Z",98,56,61], ["A","W",48,51,85],["B","G",44,57,86],["B","H",79,54,36]], columns=["id","class","c1","c2","c3"])

当我们对 id 进行 groupby 时,如果存在基于多个列的重复值(行),例如 c1,c2,c3 , 保留基于列 class.

上给出的权重的行

例如,当我们在 id A 上进行 groupby 时,c1,c2,c3 是 class X,Y,Z,其中X,Y,Z权重赋予X所以保留X并删除其他行,同样在 E,F,H 中赋予 F 权重,所以保留 F 并删除其他行。

预期输出:

output = pd.DataFrame([["A","X",98,56,61],["B","F",79,54,36],["A","W",48,51,85],["B","G",44,57,86]], columns=["id","class","c1","c2","c3"])

怎么做?

根据您的解释,您可以创建权重字典,然后创建 2 个条件,然后执行:

#add classes for weightage incase of duplicates
cls = ['X','F']
c = df.duplicated(['id','c1','c2','c3'],keep=False) 
out = df[(c&df['class'].isin(cls))|~c]

print(out)

  id class  c1  c2  c3
0  A     X  98  56  61
3  B     F  79  54  36
5  A     W  48  51  85
6  B     G  44  57  86