获取 Pandas 具有原始索引的重复行数
Get Pandas Duplicate Row Count with Original Index
我需要在 Pandas 数据框中找到重复的行,然后添加一个带有计数的额外列。假设我们有一个数据框:
>>print(df)
+----+-----+-----+-----+-----+-----+-----+-----+-----+
| | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|----+-----+-----+-----+-----+-----+-----+-----+-----|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2 | 4 | 3 | 4 | 1 | 1 | 4 | 4 |
| 3 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2 | 3 | 4 | 3 | 4 | 0 | 0 | 0 |
| 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 1 | 1 | 4 | 0 | 0 | 0 | 0 | 0 |
| 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 |
| 10 | 3 | 3 | 4 | 3 | 5 | 5 | 5 | 0 |
| 11 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 13 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| 14 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 15 | 1 | 3 | 5 | 0 | 0 | 0 | 0 | 0 |
| 16 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17 | 3 | 3 | 4 | 4 | 0 | 0 | 0 | 0 |
| 18 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+----+-----+-----+-----+-----+-----+-----+-----+-----+
上面的框架将变成下面的框架,并带有一个带有计数的附加列。你可以看到我们仍然保留了索引列。
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|----+-----+-----+-----+-----+-----+-----+-----+-----|-----|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 2 | 2 | 4 | 3 | 4 | 1 | 1 | 4 | 4 | 1 |
| 3 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 | 2 |
| 4 | 2 | 3 | 4 | 3 | 4 | 0 | 0 | 0 | 1 |
| 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| 6 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 7 | 1 | 1 | 4 | 0 | 0 | 0 | 0 | 0 | 1 |
| 10 | 3 | 3 | 4 | 3 | 5 | 5 | 5 | 0 | 1 |
| 11 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 13 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 15 | 1 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 1 |
| 16 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 17 | 3 | 3 | 4 | 4 | 0 | 0 | 0 | 0 | 1 |
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
我见过其他解决方案,例如:
df.groupby(list(df.columns.values)).size()
但是 returns 一个有间隙且没有初始索引的矩阵。
您可以通过 first
和 len
:
使用 reset_index
first for convert index
to columns and then aggregate
此外,如果需要按所有列分组,则需要删除 index
列 difference
:
print (df.columns.difference(['index']))
Index(['2', '3', '4', '5', '6', '7', '8', '9'], dtype='object')
print (df.reset_index()
.groupby(df.columns.difference(['index']).tolist())['index']
.agg(['first', 'size'])
.reset_index()
.set_index(['first'])
.sort_index()
.rename_axis(None))
2 3 4 5 6 7 8 9 size
0 0 0 0 0 0 0 0 0 2
1 2 0 0 0 0 0 0 0 2
2 2 4 3 4 1 1 4 4 1
3 4 3 4 0 0 0 0 0 2
4 2 3 4 3 4 0 0 0 1
5 5 0 0 0 0 0 0 0 3
6 4 5 0 0 0 0 0 0 1
7 1 1 4 0 0 0 0 0 1
10 3 3 4 3 5 5 5 0 1
11 5 4 0 0 0 0 0 0 1
13 0 4 0 0 0 0 0 0 1
15 1 3 5 0 0 0 0 0 1
16 4 0 0 0 0 0 0 0 1
17 3 3 4 4 0 0 0 0 1
如有必要,添加下一列 10
需要 rename
:
#if necessary convert to str
last_col = str(df.columns.astype(int).max() + 1)
print (last_col)
10
print (df.reset_index()
.groupby(df.columns.difference(['index']).tolist())['index']
.agg(['first', 'size'])
.reset_index()
.set_index(['first'])
.sort_index()
.rename_axis(None)
.rename(columns={'size':last_col}))
2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 2
1 2 0 0 0 0 0 0 0 2
2 2 4 3 4 1 1 4 4 1
3 4 3 4 0 0 0 0 0 2
4 2 3 4 3 4 0 0 0 1
5 5 0 0 0 0 0 0 0 3
6 4 5 0 0 0 0 0 0 1
7 1 1 4 0 0 0 0 0 1
10 3 3 4 3 5 5 5 0 1
11 5 4 0 0 0 0 0 0 1
13 0 4 0 0 0 0 0 0 1
15 1 3 5 0 0 0 0 0 1
16 4 0 0 0 0 0 0 0 1
17 3 3 4 4 0 0 0 0 1
我需要在 Pandas 数据框中找到重复的行,然后添加一个带有计数的额外列。假设我们有一个数据框:
>>print(df)
+----+-----+-----+-----+-----+-----+-----+-----+-----+
| | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|----+-----+-----+-----+-----+-----+-----+-----+-----|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2 | 4 | 3 | 4 | 1 | 1 | 4 | 4 |
| 3 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2 | 3 | 4 | 3 | 4 | 0 | 0 | 0 |
| 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 1 | 1 | 4 | 0 | 0 | 0 | 0 | 0 |
| 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 |
| 10 | 3 | 3 | 4 | 3 | 5 | 5 | 5 | 0 |
| 11 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 13 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| 14 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 15 | 1 | 3 | 5 | 0 | 0 | 0 | 0 | 0 |
| 16 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17 | 3 | 3 | 4 | 4 | 0 | 0 | 0 | 0 |
| 18 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+----+-----+-----+-----+-----+-----+-----+-----+-----+
上面的框架将变成下面的框架,并带有一个带有计数的附加列。你可以看到我们仍然保留了索引列。
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|----+-----+-----+-----+-----+-----+-----+-----+-----|-----|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 2 | 2 | 4 | 3 | 4 | 1 | 1 | 4 | 4 | 1 |
| 3 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 | 2 |
| 4 | 2 | 3 | 4 | 3 | 4 | 0 | 0 | 0 | 1 |
| 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| 6 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 7 | 1 | 1 | 4 | 0 | 0 | 0 | 0 | 0 | 1 |
| 10 | 3 | 3 | 4 | 3 | 5 | 5 | 5 | 0 | 1 |
| 11 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 13 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 15 | 1 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 1 |
| 16 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 17 | 3 | 3 | 4 | 4 | 0 | 0 | 0 | 0 | 1 |
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
我见过其他解决方案,例如:
df.groupby(list(df.columns.values)).size()
但是 returns 一个有间隙且没有初始索引的矩阵。
您可以通过 first
和 len
:
reset_index
first for convert index
to columns and then aggregate
此外,如果需要按所有列分组,则需要删除 index
列 difference
:
print (df.columns.difference(['index']))
Index(['2', '3', '4', '5', '6', '7', '8', '9'], dtype='object')
print (df.reset_index()
.groupby(df.columns.difference(['index']).tolist())['index']
.agg(['first', 'size'])
.reset_index()
.set_index(['first'])
.sort_index()
.rename_axis(None))
2 3 4 5 6 7 8 9 size
0 0 0 0 0 0 0 0 0 2
1 2 0 0 0 0 0 0 0 2
2 2 4 3 4 1 1 4 4 1
3 4 3 4 0 0 0 0 0 2
4 2 3 4 3 4 0 0 0 1
5 5 0 0 0 0 0 0 0 3
6 4 5 0 0 0 0 0 0 1
7 1 1 4 0 0 0 0 0 1
10 3 3 4 3 5 5 5 0 1
11 5 4 0 0 0 0 0 0 1
13 0 4 0 0 0 0 0 0 1
15 1 3 5 0 0 0 0 0 1
16 4 0 0 0 0 0 0 0 1
17 3 3 4 4 0 0 0 0 1
如有必要,添加下一列 10
需要 rename
:
#if necessary convert to str
last_col = str(df.columns.astype(int).max() + 1)
print (last_col)
10
print (df.reset_index()
.groupby(df.columns.difference(['index']).tolist())['index']
.agg(['first', 'size'])
.reset_index()
.set_index(['first'])
.sort_index()
.rename_axis(None)
.rename(columns={'size':last_col}))
2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 2
1 2 0 0 0 0 0 0 0 2
2 2 4 3 4 1 1 4 4 1
3 4 3 4 0 0 0 0 0 2
4 2 3 4 3 4 0 0 0 1
5 5 0 0 0 0 0 0 0 3
6 4 5 0 0 0 0 0 0 1
7 1 1 4 0 0 0 0 0 1
10 3 3 4 3 5 5 5 0 1
11 5 4 0 0 0 0 0 0 1
13 0 4 0 0 0 0 0 0 1
15 1 3 5 0 0 0 0 0 1
16 4 0 0 0 0 0 0 0 1
17 3 3 4 4 0 0 0 0 1