标记所有重复项 - Pandas Dataframe - 即使是输出中没有 'NaN 的第一个实例
Tagging ALL duplicates - Pandas Dataframe - even the first instace without 'NaN's in output
我有这个数据框(示例)
employees = [('Mohd', 28, 'NY'),
('Anne', 32, 'London'),
('Aaditya', 25, 'Mumbai'),
('Anne', 32, 'London'),
('Anne', 32, 'London'),
('Anne', 32, 'Mumbai'),
('Aaditya', 40, 'Dubai'),
('Link', 32, 'London')]
emp = pd.DataFrame(employees, columns = ['Name', 'Age', 'City'])
我想找出城市形式的重复项并将值存储在数据框本身中。如果我这样做 emp["duplname"] = emp.Name.duplicated()
,我得到
Name Age City duplname
0 Mohd 28 NY False
1 Anne 32 London *False*
2 Aaditya 25 Mumbai *False*
3 Anne 32 London True
4 Anne 32 London True
5 Anne 32 Mumbai True
6 Aaditya 40 Dubai True
7 Link 32 London False
但是,我希望 ** 中的 duplname
也为 True - 因为从技术上讲,它是重复的。所以我改为这样做 -
g = emp.groupby(['Name'])
df1 = emp.set_index(['Name'])
emp['dup_index'] = df1.index.map(lambda ind: g.indices[ind][0])
emp['counts'] = emp['dup_index'].value_counts()
但这给了我一个带有 NaN 的输出
Name Age City duplname dup_index counts
0 Mohd 28 NY False 0 1.0
1 Anne 32 London False 1 4.0
2 Aaditya 25 Mumbai False 2 2.0
3 Anne 32 London True 1 NaN
4 Anne 32 London True 1 NaN
5 Anne 32 Mumbai True 1 NaN
6 Aaditya 40 Dubai True 2 NaN
7 Link 32 London False 7 1.0
NaN 不是描述性的,有时名称会丢失,因此 'NaNs' 具有误导性。有没有办法标记所有重复项?
将keep=False
添加到Series.duplicated
以标记所有重复项:
emp["duplname"] = emp.Name.duplicated(keep=False)
emp
:
Name Age City duplname
0 Mohd 28 NY False
1 Anne 32 London True
2 Aaditya 25 Mumbai True
3 Anne 32 London True
4 Anne 32 London True
5 Anne 32 Mumbai True
6 Aaditya 40 Dubai True
7 Link 32 London False
我有这个数据框(示例)
employees = [('Mohd', 28, 'NY'),
('Anne', 32, 'London'),
('Aaditya', 25, 'Mumbai'),
('Anne', 32, 'London'),
('Anne', 32, 'London'),
('Anne', 32, 'Mumbai'),
('Aaditya', 40, 'Dubai'),
('Link', 32, 'London')]
emp = pd.DataFrame(employees, columns = ['Name', 'Age', 'City'])
我想找出城市形式的重复项并将值存储在数据框本身中。如果我这样做 emp["duplname"] = emp.Name.duplicated()
,我得到
Name Age City duplname
0 Mohd 28 NY False
1 Anne 32 London *False*
2 Aaditya 25 Mumbai *False*
3 Anne 32 London True
4 Anne 32 London True
5 Anne 32 Mumbai True
6 Aaditya 40 Dubai True
7 Link 32 London False
但是,我希望 ** 中的 duplname
也为 True - 因为从技术上讲,它是重复的。所以我改为这样做 -
g = emp.groupby(['Name'])
df1 = emp.set_index(['Name'])
emp['dup_index'] = df1.index.map(lambda ind: g.indices[ind][0])
emp['counts'] = emp['dup_index'].value_counts()
但这给了我一个带有 NaN 的输出
Name Age City duplname dup_index counts
0 Mohd 28 NY False 0 1.0
1 Anne 32 London False 1 4.0
2 Aaditya 25 Mumbai False 2 2.0
3 Anne 32 London True 1 NaN
4 Anne 32 London True 1 NaN
5 Anne 32 Mumbai True 1 NaN
6 Aaditya 40 Dubai True 2 NaN
7 Link 32 London False 7 1.0
NaN 不是描述性的,有时名称会丢失,因此 'NaNs' 具有误导性。有没有办法标记所有重复项?
将keep=False
添加到Series.duplicated
以标记所有重复项:
emp["duplname"] = emp.Name.duplicated(keep=False)
emp
:
Name Age City duplname
0 Mohd 28 NY False
1 Anne 32 London True
2 Aaditya 25 Mumbai True
3 Anne 32 London True
4 Anne 32 London True
5 Anne 32 Mumbai True
6 Aaditya 40 Dubai True
7 Link 32 London False