Pandas:如果在 groupby 之后基于其他列存在重复项,则根据特定列的权重保留特定行
Pandas: Retain the a particular row based on weightage given on a particular column, if duplicates are present based on other columns after groupby
我有一个数据框df
df = pd.DataFrame([["A","X",98,56,61], ["B","E",79,54,36], ["A","Y",98,56,61],["B","F",79,54,36], ["A","Z",98,56,61], ["A","W",48,51,85],["B","G",44,57,86],["B","H",79,54,36]], columns=["id","class","c1","c2","c3"])
当我们对 id 进行 groupby 时,如果存在基于多个列的重复值(行),例如 c1,c2,c3 , 保留基于列 class.
上给出的权重的行
例如,当我们在 id A 上进行 groupby 时,c1,c2,c3 是 class X,Y,Z,其中X,Y,Z权重赋予X所以保留X并删除其他行,同样在 E,F,H 中赋予 F 权重,所以保留 F 并删除其他行。
预期输出:
output = pd.DataFrame([["A","X",98,56,61],["B","F",79,54,36],["A","W",48,51,85],["B","G",44,57,86]], columns=["id","class","c1","c2","c3"])
怎么做?
根据您的解释,您可以创建权重字典,然后创建 2 个条件,然后执行:
#add classes for weightage incase of duplicates
cls = ['X','F']
c = df.duplicated(['id','c1','c2','c3'],keep=False)
out = df[(c&df['class'].isin(cls))|~c]
print(out)
id class c1 c2 c3
0 A X 98 56 61
3 B F 79 54 36
5 A W 48 51 85
6 B G 44 57 86
我有一个数据框df
df = pd.DataFrame([["A","X",98,56,61], ["B","E",79,54,36], ["A","Y",98,56,61],["B","F",79,54,36], ["A","Z",98,56,61], ["A","W",48,51,85],["B","G",44,57,86],["B","H",79,54,36]], columns=["id","class","c1","c2","c3"])
当我们对 id 进行 groupby 时,如果存在基于多个列的重复值(行),例如 c1,c2,c3 , 保留基于列 class.
上给出的权重的行例如,当我们在 id A 上进行 groupby 时,c1,c2,c3 是 class X,Y,Z,其中X,Y,Z权重赋予X所以保留X并删除其他行,同样在 E,F,H 中赋予 F 权重,所以保留 F 并删除其他行。
预期输出:
output = pd.DataFrame([["A","X",98,56,61],["B","F",79,54,36],["A","W",48,51,85],["B","G",44,57,86]], columns=["id","class","c1","c2","c3"])
怎么做?
根据您的解释,您可以创建权重字典,然后创建 2 个条件,然后执行:
#add classes for weightage incase of duplicates
cls = ['X','F']
c = df.duplicated(['id','c1','c2','c3'],keep=False)
out = df[(c&df['class'].isin(cls))|~c]
print(out)
id class c1 c2 c3
0 A X 98 56 61
3 B F 79 54 36
5 A W 48 51 85
6 B G 44 57 86