如何计算 Pandas 列中的重复项?
How to count duplicates in column Pandas?
我使用此规则过滤列号唯一的所有行。所以我删除了重复项:
df.drop_duplicates(subset=["num"], keep=False)
我对列 age 也做同样的事情:
df.drop_duplicates(subset=["age"], keep=False)
如何将结果显示为另一个 table,其中包含已删除元素的统计信息,如下所示:
Duplicates num(total) Duplicates age (total)
1 18
感谢您的解答,一注为:
NUM AGE
1 18
2 18
3 18
4 20
结果我需要得到:
NUM AGE
4 20
并提取重复项(NUM 值列)以列出 Python 以进一步插入数据库
duplicatesNums = [1,2,3]
IIUC,可以用duplicated
in apply
, with sum
来统计数值:
df[['num', 'age']].apply(lambda c: c.duplicated(keep=False).sum())
虚拟示例:
df = pd.DataFrame({'num': list('AABCDD'), 'age': list('112345')})
df[['num', 'age']].apply(lambda c: c.duplicated(keep=False).sum())
输出:
num 4
age 2
dtype: int64
补充问题:
# identify duplicates
mask = df['AGE'].duplicated(keep=False)
# get indices
ids = mask[mask].index.to_list()
# [0, 1, 2]
# filter DataFrame
df2 = df[~mask]
# NUM AGE
# 3 4 20
对于新的 DataFrame 调用 Series.duplicated
per columns in DataFrame.apply
,用 sum
计算 True
s 并且对于一行 DataFrame 将 Series
转换为 DataFrame
并使用转置,还有 rename
列:
d = {'num': [1, 1, 1, 1, 3, 1, 2, 2],
'age': [10, 10, 10, 11, 11, 98, 99, 102]}
df = pd.DataFrame(data=d)
print (df)
num age
0 1 10
1 1 10
2 1 10
3 1 11
4 3 11
5 1 98
6 2 99
7 2 102
f = lambda x: f'Duplicates {x} (total)'
df = (df[['num','age']].apply(lambda x: x.duplicated(keep=False))
.sum()
.rename(f)
.to_frame()
.T)
print (df)
Duplicates num (total) Duplicates age (total)
0 7 5
备选方案:
df = pd.DataFrame({f'Duplicates {x} (total)' : [df[x].duplicated(keep=False).sum()]
for x in ['num','age']})
print (df)
Duplicates num (total) Duplicates age (total)
0 7 5
编辑:对于测试非重复行使用:
df1 = df[~df['AGE'].duplicated(keep=False)]
print (df1)
NUM AGE
3 4 20
要获取列表中 NUM
列的重复值,请使用:
duplicatesNums = df.loc[df['AGE'].duplicated(keep=False), 'NUM'].tolist()
print (duplicatesNums)
[1, 2, 3]
如果NUM
是索引:
print (df)
AGE
NUM
1 18
2 18
3 18
4 20
duplicatesNums = df.index[df['AGE'].duplicated(keep=False)].tolist()
print (duplicatesNums)
[1, 2, 3]
我使用此规则过滤列号唯一的所有行。所以我删除了重复项:
df.drop_duplicates(subset=["num"], keep=False)
我对列 age 也做同样的事情:
df.drop_duplicates(subset=["age"], keep=False)
如何将结果显示为另一个 table,其中包含已删除元素的统计信息,如下所示:
Duplicates num(total) Duplicates age (total)
1 18
感谢您的解答,一注为:
NUM AGE
1 18
2 18
3 18
4 20
结果我需要得到:
NUM AGE
4 20
并提取重复项(NUM 值列)以列出 Python 以进一步插入数据库
duplicatesNums = [1,2,3]
IIUC,可以用duplicated
in apply
, with sum
来统计数值:
df[['num', 'age']].apply(lambda c: c.duplicated(keep=False).sum())
虚拟示例:
df = pd.DataFrame({'num': list('AABCDD'), 'age': list('112345')})
df[['num', 'age']].apply(lambda c: c.duplicated(keep=False).sum())
输出:
num 4
age 2
dtype: int64
补充问题:
# identify duplicates
mask = df['AGE'].duplicated(keep=False)
# get indices
ids = mask[mask].index.to_list()
# [0, 1, 2]
# filter DataFrame
df2 = df[~mask]
# NUM AGE
# 3 4 20
对于新的 DataFrame 调用 Series.duplicated
per columns in DataFrame.apply
,用 sum
计算 True
s 并且对于一行 DataFrame 将 Series
转换为 DataFrame
并使用转置,还有 rename
列:
d = {'num': [1, 1, 1, 1, 3, 1, 2, 2],
'age': [10, 10, 10, 11, 11, 98, 99, 102]}
df = pd.DataFrame(data=d)
print (df)
num age
0 1 10
1 1 10
2 1 10
3 1 11
4 3 11
5 1 98
6 2 99
7 2 102
f = lambda x: f'Duplicates {x} (total)'
df = (df[['num','age']].apply(lambda x: x.duplicated(keep=False))
.sum()
.rename(f)
.to_frame()
.T)
print (df)
Duplicates num (total) Duplicates age (total)
0 7 5
备选方案:
df = pd.DataFrame({f'Duplicates {x} (total)' : [df[x].duplicated(keep=False).sum()]
for x in ['num','age']})
print (df)
Duplicates num (total) Duplicates age (total)
0 7 5
编辑:对于测试非重复行使用:
df1 = df[~df['AGE'].duplicated(keep=False)]
print (df1)
NUM AGE
3 4 20
要获取列表中 NUM
列的重复值,请使用:
duplicatesNums = df.loc[df['AGE'].duplicated(keep=False), 'NUM'].tolist()
print (duplicatesNums)
[1, 2, 3]
如果NUM
是索引:
print (df)
AGE
NUM
1 18
2 18
3 18
4 20
duplicatesNums = df.index[df['AGE'].duplicated(keep=False)].tolist()
print (duplicatesNums)
[1, 2, 3]