在不同列的重复项中保留最高值的行

Question

我有一个像这样的 pandas 数据框，其中我可以有具有相同经度和纬度组合的行：

初始 df:

   lon  lat         name  value protection      a      b         c  score
0   20   10       canada    563        NaN    cat    dog  elephant   20.0
1   30   10       canada     65        NaN   lion  tiger       cat   30.0
2   40   20       canada    893        NaN    dog    NaN       NaN   20.0
3   40   20          usa      4        NaN  horse  horse      lion   40.0
4   45   15          usa   8593        NaN    NaN   lion       cat   10.0
5   20   10  protection1    100     medium    NaN    NaN       NaN    NaN
6   40   20  protection1     20       high    NaN    NaN       NaN    NaN
7   50   30  protection1    500        low    NaN    NaN       NaN    NaN

但我想要的是：

想要的输出：

   lon  lat protection      a      b         c  score
0   20   10     medium    cat    dog  elephant   20.0
1   30   10        NaN   lion  tiger       cat   30.0
2   40   20       high  horse  horse      lion   40.0
3   45   15        NaN    NaN   lion       cat   10.0
4   50   30        low    NaN    NaN       NaN    NaN

输出数据框应包含具有 long 和 lat 列的唯一组合的行，其中仅保留具有最高 score 的行，但如果 long 和lat 在 protection 列中有重复项和一个值，这些应该合并为一个

Answer 1

尝试：

df = df.sort_values(by="score", ascending=False)
g = df.groupby(["lon", "lat"])
df_out = (
    g.first()
    .assign(
        protection=g.agg(
            {"protection": lambda x: ",".join(x.dropna())}
        ).replace("", np.nan)
    )
    .reset_index()
)

print(df_out)

打印：

   lon  lat         name  value protection      a      b         c  score
0   20   10       canada    563     medium    cat    dog  elephant   20.0
1   30   10       canada     65        NaN   lion  tiger       cat   30.0
2   40   20          usa      4       high  horse  horse      lion   40.0
3   45   15          usa   8593        NaN    NaN   lion       cat   10.0
4   50   30  protection1    500        low    NaN    NaN       NaN    NaN

Answer 2

一种方法是使用您的逻辑创建一个函数并通过以下方式应用于组：

def func(df1):
    df1 = df1.fillna(method='bfill')
    df1 = df1.fillna(method='ffill')
    return df1.sort_values('score', ascending=False)[:1]

result = df.groupby(['lon', 'lat']).apply(func)

添加重置索引和选择以获得准确的发布输出：

result.reset_index(drop=True)[['lon', 'lat', 'protection', 'a', 'b', 'c', 'score']]

	lon	lat	protection	a	b	c	score
0	20	10	medium	cat	dog	elephant	20
1	30	10	nan	lion	tiger	cat	30
2	40	20	high	horse	horse	lion	40
3	45	15	nan	nan	lion	cat	10
4	50	30	low	nan	nan	nan	nan

在不同列的重复项中保留最高值的行

keep row with highest value amongst duplicates on different columns

rows

duplicates

pandas