在不同列的重复项中保留最高值的行

keep row with highest value amongst duplicates on different columns

我有一个像这样的 pandas 数据框,其中我可以有具有相同经度和纬度组合的行:

初始 df:

   lon  lat         name  value protection      a      b         c  score
0   20   10       canada    563        NaN    cat    dog  elephant   20.0
1   30   10       canada     65        NaN   lion  tiger       cat   30.0
2   40   20       canada    893        NaN    dog    NaN       NaN   20.0
3   40   20          usa      4        NaN  horse  horse      lion   40.0
4   45   15          usa   8593        NaN    NaN   lion       cat   10.0
5   20   10  protection1    100     medium    NaN    NaN       NaN    NaN
6   40   20  protection1     20       high    NaN    NaN       NaN    NaN
7   50   30  protection1    500        low    NaN    NaN       NaN    NaN

但我想要的是:

想要的输出:

   lon  lat protection      a      b         c  score
0   20   10     medium    cat    dog  elephant   20.0
1   30   10        NaN   lion  tiger       cat   30.0
2   40   20       high  horse  horse      lion   40.0
3   45   15        NaN    NaN   lion       cat   10.0
4   50   30        low    NaN    NaN       NaN    NaN

输出数据框应包含具有 longlat 列的唯一组合的行,其中仅保留具有最高 score 的行,但如果 longlatprotection 列中有重复项和一个值,这些应该合并为一个

尝试:

df = df.sort_values(by="score", ascending=False)
g = df.groupby(["lon", "lat"])
df_out = (
    g.first()
    .assign(
        protection=g.agg(
            {"protection": lambda x: ",".join(x.dropna())}
        ).replace("", np.nan)
    )
    .reset_index()
)

print(df_out)

打印:

   lon  lat         name  value protection      a      b         c  score
0   20   10       canada    563     medium    cat    dog  elephant   20.0
1   30   10       canada     65        NaN   lion  tiger       cat   30.0
2   40   20          usa      4       high  horse  horse      lion   40.0
3   45   15          usa   8593        NaN    NaN   lion       cat   10.0
4   50   30  protection1    500        low    NaN    NaN       NaN    NaN

一种方法是使用您的逻辑创建一个函数并通过以下方式应用于组:

def func(df1):
    df1 = df1.fillna(method='bfill')
    df1 = df1.fillna(method='ffill')
    return df1.sort_values('score', ascending=False)[:1]

result = df.groupby(['lon', 'lat']).apply(func)

添加重置索引和选择以获得准确的发布输出:

result.reset_index(drop=True)[['lon', 'lat', 'protection', 'a', 'b', 'c', 'score']]
lon lat protection a b c score
0 20 10 medium cat dog elephant 20
1 30 10 nan lion tiger cat 30
2 40 20 high horse horse lion 40
3 45 15 nan nan lion cat 10
4 50 30 low nan nan nan nan