如何计算 Pandas 列中的重复项?

How to count duplicates in column Pandas?

我使用此规则过滤列号唯一的所有行。所以我删除了重复项:

df.drop_duplicates(subset=["num"], keep=False)

我对列 age 也做同样的事情:

df.drop_duplicates(subset=["age"], keep=False)

如何将结果显示为另一个 table,其中包含已删除元素的统计信息,如下所示:

Duplicates num(total)  Duplicates age (total)
1                      18

感谢您的解答,一注为:

NUM AGE
1        18
2        18
3        18
4        20

结果我需要得到:

NUM AGE
4        20

并提取重复项(NUM 值列)以列出 Python 以进一步插入数据库 duplicatesNums = [1,2,3]

IIUC,可以用duplicated in apply, with sum来统计数值:

df[['num', 'age']].apply(lambda c: c.duplicated(keep=False).sum())

虚拟示例:

df = pd.DataFrame({'num': list('AABCDD'), 'age': list('112345')})
df[['num', 'age']].apply(lambda c: c.duplicated(keep=False).sum())

输出:

num    4
age    2
dtype: int64
补充问题:
# identify duplicates
mask = df['AGE'].duplicated(keep=False)
# get indices
ids = mask[mask].index.to_list()
# [0, 1, 2]

# filter DataFrame
df2 = df[~mask]
#     NUM  AGE
#  3    4   20

对于新的 DataFrame 调用 Series.duplicated per columns in DataFrame.apply,用 sum 计算 Trues 并且对于一行 DataFrame 将 Series 转换为 DataFrame 并使用转置,还有 rename 列:

d = {'num': [1, 1, 1, 1, 3, 1, 2, 2],
      'age': [10, 10, 10, 11, 11, 98, 99, 102]}
df = pd.DataFrame(data=d)
print (df)
   num  age
0    1   10
1    1   10
2    1   10
3    1   11
4    3   11
5    1   98
6    2   99
7    2  102

f = lambda x: f'Duplicates {x} (total)'
df = (df[['num','age']].apply(lambda x: x.duplicated(keep=False))
                       .sum()
                       .rename(f)
                       .to_frame()
                       .T)
    
print (df)
   Duplicates num (total)  Duplicates age (total)
0                       7                       5

备选方案:

df = pd.DataFrame({f'Duplicates {x} (total)' : [df[x].duplicated(keep=False).sum()] 
                    for x in ['num','age']})
   
print (df)

   Duplicates num (total)  Duplicates age (total)
0                       7                       5

编辑:对于测试非重复行使用:

df1 = df[~df['AGE'].duplicated(keep=False)]
print (df1)
   NUM  AGE
3    4   20

要获取列表中 NUM 列的重复值,请使用:

duplicatesNums = df.loc[df['AGE'].duplicated(keep=False), 'NUM'].tolist()
print (duplicatesNums)
[1, 2, 3]

如果NUM是索引:

print (df)
     AGE
NUM     
1     18
2     18
3     18
4     20

duplicatesNums = df.index[df['AGE'].duplicated(keep=False)].tolist()
print (duplicatesNums)
[1, 2, 3]