在不同列的重复项中保留最高值的行
keep row with highest value amongst duplicates on different columns
我有一个像这样的 pandas 数据框,其中我可以有具有相同经度和纬度组合的行:
初始 df:
lon lat name value protection a b c score
0 20 10 canada 563 NaN cat dog elephant 20.0
1 30 10 canada 65 NaN lion tiger cat 30.0
2 40 20 canada 893 NaN dog NaN NaN 20.0
3 40 20 usa 4 NaN horse horse lion 40.0
4 45 15 usa 8593 NaN NaN lion cat 10.0
5 20 10 protection1 100 medium NaN NaN NaN NaN
6 40 20 protection1 20 high NaN NaN NaN NaN
7 50 30 protection1 500 low NaN NaN NaN NaN
但我想要的是:
想要的输出:
lon lat protection a b c score
0 20 10 medium cat dog elephant 20.0
1 30 10 NaN lion tiger cat 30.0
2 40 20 high horse horse lion 40.0
3 45 15 NaN NaN lion cat 10.0
4 50 30 low NaN NaN NaN NaN
输出数据框应包含具有 long
和 lat
列的唯一组合的行,其中仅保留具有最高 score
的行,但如果 long
和lat
在 protection
列中有重复项和一个值,这些应该合并为一个
尝试:
df = df.sort_values(by="score", ascending=False)
g = df.groupby(["lon", "lat"])
df_out = (
g.first()
.assign(
protection=g.agg(
{"protection": lambda x: ",".join(x.dropna())}
).replace("", np.nan)
)
.reset_index()
)
print(df_out)
打印:
lon lat name value protection a b c score
0 20 10 canada 563 medium cat dog elephant 20.0
1 30 10 canada 65 NaN lion tiger cat 30.0
2 40 20 usa 4 high horse horse lion 40.0
3 45 15 usa 8593 NaN NaN lion cat 10.0
4 50 30 protection1 500 low NaN NaN NaN NaN
一种方法是使用您的逻辑创建一个函数并通过以下方式应用于组:
def func(df1):
df1 = df1.fillna(method='bfill')
df1 = df1.fillna(method='ffill')
return df1.sort_values('score', ascending=False)[:1]
result = df.groupby(['lon', 'lat']).apply(func)
添加重置索引和选择以获得准确的发布输出:
result.reset_index(drop=True)[['lon', 'lat', 'protection', 'a', 'b', 'c', 'score']]
lon
lat
protection
a
b
c
score
0
20
10
medium
cat
dog
elephant
20
1
30
10
nan
lion
tiger
cat
30
2
40
20
high
horse
horse
lion
40
3
45
15
nan
nan
lion
cat
10
4
50
30
low
nan
nan
nan
nan
我有一个像这样的 pandas 数据框,其中我可以有具有相同经度和纬度组合的行:
初始 df:
lon lat name value protection a b c score
0 20 10 canada 563 NaN cat dog elephant 20.0
1 30 10 canada 65 NaN lion tiger cat 30.0
2 40 20 canada 893 NaN dog NaN NaN 20.0
3 40 20 usa 4 NaN horse horse lion 40.0
4 45 15 usa 8593 NaN NaN lion cat 10.0
5 20 10 protection1 100 medium NaN NaN NaN NaN
6 40 20 protection1 20 high NaN NaN NaN NaN
7 50 30 protection1 500 low NaN NaN NaN NaN
但我想要的是:
想要的输出:
lon lat protection a b c score
0 20 10 medium cat dog elephant 20.0
1 30 10 NaN lion tiger cat 30.0
2 40 20 high horse horse lion 40.0
3 45 15 NaN NaN lion cat 10.0
4 50 30 low NaN NaN NaN NaN
输出数据框应包含具有 long
和 lat
列的唯一组合的行,其中仅保留具有最高 score
的行,但如果 long
和lat
在 protection
列中有重复项和一个值,这些应该合并为一个
尝试:
df = df.sort_values(by="score", ascending=False)
g = df.groupby(["lon", "lat"])
df_out = (
g.first()
.assign(
protection=g.agg(
{"protection": lambda x: ",".join(x.dropna())}
).replace("", np.nan)
)
.reset_index()
)
print(df_out)
打印:
lon lat name value protection a b c score
0 20 10 canada 563 medium cat dog elephant 20.0
1 30 10 canada 65 NaN lion tiger cat 30.0
2 40 20 usa 4 high horse horse lion 40.0
3 45 15 usa 8593 NaN NaN lion cat 10.0
4 50 30 protection1 500 low NaN NaN NaN NaN
一种方法是使用您的逻辑创建一个函数并通过以下方式应用于组:
def func(df1):
df1 = df1.fillna(method='bfill')
df1 = df1.fillna(method='ffill')
return df1.sort_values('score', ascending=False)[:1]
result = df.groupby(['lon', 'lat']).apply(func)
添加重置索引和选择以获得准确的发布输出:
result.reset_index(drop=True)[['lon', 'lat', 'protection', 'a', 'b', 'c', 'score']]
lon | lat | protection | a | b | c | score | |
---|---|---|---|---|---|---|---|
0 | 20 | 10 | medium | cat | dog | elephant | 20 |
1 | 30 | 10 | nan | lion | tiger | cat | 30 |
2 | 40 | 20 | high | horse | horse | lion | 40 |
3 | 45 | 15 | nan | nan | lion | cat | 10 |
4 | 50 | 30 | low | nan | nan | nan | nan |