按 id 保留第一次出现的行,并在列中的值更改时保留第一次出现的行
Keep first occurrence row by id and first occurrence when value in column changes
在下面的示例 df 中,最好的保留方法是什么:
- 每个
id
出现Score
时的第一行
- 然后当每个
id
的值在 Score
中发生变化时的第一行,并删除重复的行直到它发生变化
示例 df
date id Score
0 2001-09-06 1 3
1 2001-09-07 1 3
2 2001-09-08 1 4
3 2001-09-09 2 6
4 2001-09-10 2 6
5 2001-09-11 1 4
6 2001-09-12 2 5
7 2001-09-13 2 5
8 2001-09-14 1 3
期望的 df
date id Score
0 2001-09-06 1 3
1 2001-09-08 1 4
2 2001-09-09 2 6
3 2001-09-12 2 5
4 2001-09-14 1 3
df.groupby(['id', 'score']).first()
按照你的逻辑:
# shift Score within id
# shifted score at each group start is `NaN`
shifted_scores = df['Score'].groupby(df['id']).shift()
# change of Score within each id
# since first shifted score in each group is `NaN`
# mask is also True at first line of each group
mask = df['Score'].ne(shifted_scores)
# output
df[mask]
输出:
date id Score
0 2001-09-06 1 3
2 2001-09-08 1 4
3 2001-09-09 2 6
6 2001-09-12 2 5
8 2001-09-14 1 3
将groupby
与diff
一起使用:
print (df[df.groupby("id")["Score"].diff()!=0])
date id Score
0 2001-09-06 1 3
2 2001-09-08 1 4
3 2001-09-09 2 6
6 2001-09-12 2 5
8 2001-09-14 1 3
第一次出现总是 NaN
!=0.
在下面的示例 df 中,最好的保留方法是什么:
- 每个
id
出现Score
时的第一行 - 然后当每个
id
的值在Score
中发生变化时的第一行,并删除重复的行直到它发生变化
示例 df
date id Score
0 2001-09-06 1 3
1 2001-09-07 1 3
2 2001-09-08 1 4
3 2001-09-09 2 6
4 2001-09-10 2 6
5 2001-09-11 1 4
6 2001-09-12 2 5
7 2001-09-13 2 5
8 2001-09-14 1 3
期望的 df
date id Score
0 2001-09-06 1 3
1 2001-09-08 1 4
2 2001-09-09 2 6
3 2001-09-12 2 5
4 2001-09-14 1 3
df.groupby(['id', 'score']).first()
按照你的逻辑:
# shift Score within id
# shifted score at each group start is `NaN`
shifted_scores = df['Score'].groupby(df['id']).shift()
# change of Score within each id
# since first shifted score in each group is `NaN`
# mask is also True at first line of each group
mask = df['Score'].ne(shifted_scores)
# output
df[mask]
输出:
date id Score
0 2001-09-06 1 3
2 2001-09-08 1 4
3 2001-09-09 2 6
6 2001-09-12 2 5
8 2001-09-14 1 3
将groupby
与diff
一起使用:
print (df[df.groupby("id")["Score"].diff()!=0])
date id Score
0 2001-09-06 1 3
2 2001-09-08 1 4
3 2001-09-09 2 6
6 2001-09-12 2 5
8 2001-09-14 1 3
第一次出现总是 NaN
!=0.