使用多列组合查找 pandas 中的唯一记录
Finding unique records in pandas with multiple column combination
假设我有这个 pandas 数据框,
df
street_id district_id region_id value1 value2
1 6 8 7 5
1 5 8 9 3
2 6 5 8 0
2 6 5 6 2
3 4 8 5 1
3 7 9 0 2
预期输出是,
street_id district_id region_id
2 6 5
3 4 8
3 7 9
我只想select 一个区域内唯一的街道记录。我不能只找到 street_id 和 region_id 的唯一性,因为我还需要 district_id。我该怎么做?
此处街道的唯一性由仅存在于一个地区的一个区内的街道定义。
IIUC:
In [15]: df.assign(x=df.groupby(['region_id','street_id'])['district_id']
.transform('nunique')) \
...: .query("x == 1") \
...: .drop_duplicates(subset=['street_id','region_id']) \
...: .drop('x',1)
Out[15]:
street_id district_id region_id value1 value2
2 2 6 5 8 0
4 3 4 8 5 1
5 3 7 9 0 2
或作为 更好更短的版本:
df[df.groupby(['region_id','street_id'])['district_id']
.transform('nunique').eq(1)] \
.drop_duplicates(subset=['street_id','region_id'])
细分:
In [16]: df.groupby(['region_id','street_id'])['district_id'].transform('nunique')
Out[16]:
0 2
1 2
2 1
3 1
4 1
5 1
Name: district_id, dtype: int64
In [17]: df.assign(x=df.groupby(['region_id','street_id'])['district_id'].transform('nunique'))
Out[17]:
street_id district_id region_id value1 value2 x
0 1 6 8 7 5 2
1 1 5 8 9 3 2
2 2 6 5 8 0 1
3 2 6 5 6 2 1
4 3 4 8 5 1 1
5 3 7 9 0 2 1
In [18]: df.assign(x=df.groupby(['region_id','street_id'])['district_id'].transform('nunique')) \
...: .query("x == 1") \
...:
Out[18]:
street_id district_id region_id value1 value2 x
2 2 6 5 8 0 1
3 2 6 5 6 2 1
4 3 4 8 5 1 1
5 3 7 9 0 2 1
In [19]: df.assign(x=df.groupby(['region_id','street_id'])['district_id'].transform('nunique')) \
...: .query("x == 1") \
...: .drop_duplicates(subset=['street_id','region_id']) \
...:
Out[19]:
street_id district_id region_id value1 value2 x
2 2 6 5 8 0 1
4 3 4 8 5 1 1
5 3 7 9 0 2 1
假设我有这个 pandas 数据框,
df
street_id district_id region_id value1 value2
1 6 8 7 5
1 5 8 9 3
2 6 5 8 0
2 6 5 6 2
3 4 8 5 1
3 7 9 0 2
预期输出是,
street_id district_id region_id
2 6 5
3 4 8
3 7 9
我只想select 一个区域内唯一的街道记录。我不能只找到 street_id 和 region_id 的唯一性,因为我还需要 district_id。我该怎么做?
此处街道的唯一性由仅存在于一个地区的一个区内的街道定义。
IIUC:
In [15]: df.assign(x=df.groupby(['region_id','street_id'])['district_id']
.transform('nunique')) \
...: .query("x == 1") \
...: .drop_duplicates(subset=['street_id','region_id']) \
...: .drop('x',1)
Out[15]:
street_id district_id region_id value1 value2
2 2 6 5 8 0
4 3 4 8 5 1
5 3 7 9 0 2
或作为
df[df.groupby(['region_id','street_id'])['district_id']
.transform('nunique').eq(1)] \
.drop_duplicates(subset=['street_id','region_id'])
细分:
In [16]: df.groupby(['region_id','street_id'])['district_id'].transform('nunique')
Out[16]:
0 2
1 2
2 1
3 1
4 1
5 1
Name: district_id, dtype: int64
In [17]: df.assign(x=df.groupby(['region_id','street_id'])['district_id'].transform('nunique'))
Out[17]:
street_id district_id region_id value1 value2 x
0 1 6 8 7 5 2
1 1 5 8 9 3 2
2 2 6 5 8 0 1
3 2 6 5 6 2 1
4 3 4 8 5 1 1
5 3 7 9 0 2 1
In [18]: df.assign(x=df.groupby(['region_id','street_id'])['district_id'].transform('nunique')) \
...: .query("x == 1") \
...:
Out[18]:
street_id district_id region_id value1 value2 x
2 2 6 5 8 0 1
3 2 6 5 6 2 1
4 3 4 8 5 1 1
5 3 7 9 0 2 1
In [19]: df.assign(x=df.groupby(['region_id','street_id'])['district_id'].transform('nunique')) \
...: .query("x == 1") \
...: .drop_duplicates(subset=['street_id','region_id']) \
...:
Out[19]:
street_id district_id region_id value1 value2 x
2 2 6 5 8 0 1
4 3 4 8 5 1 1
5 3 7 9 0 2 1