如何使用 pandas 从完整数据框中查找重复项?
How to find duplicates from a full data frame using pandas?
我有一个数据框,其中包含 3 列 classes,每列 class 有 5 行学生。其中一些学生是重复的。我想列出所有 classes 中最常见的学生姓名,并按降序排列,它们存在的次数以及它们存在的 classes。
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
biology statistics ecology
0 ryan sarah austin
1 sarah ed ryan
2 tom jacob tom
3 ed ryan sam
4 jackson de sarah
我希望输出看起来像这样:
ryan, 3 classes, (biology, statistics, ecology)
sarah, 3 classes, (biology, statistics, ecology)
tom, 2 classes, (biology, ecology)
ed, 2 classes, (biology, statistics)
jackson, 1 class, (biology)
jacob, 1 class, (statistics)
de, 1 class, (statistics)
austin, 1 class, (ecology)
...等等
如有任何帮助,我将不胜感激,我是初学者,所以我已经用了几个小时了。大脑正在被杀死。谢谢!
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
results = {}
for h in df:
for k,v in df[h].value_counts().items():
print(k,v)
if k in results:
results[k]['value'] += v
results[k]['class'].append(h)
else:
results[k] = {'value':v,'class':[h]}
results = {h:results[h] for h in sorted(results, key=lambda x:results[x]['value'],reverse=True)}
我们可以melt
the DataFrame to get to long form, then groupby aggregate
with Named Aggregation to get both the number of classes, and the names of the classes, lastly we can sort_values
先得到频率最高的学生:
output_df = (
df.melt(var_name='class name', value_name='student name')
.groupby('student name', as_index=False)
.agg(class_count=('class name', 'count'),
classes=('class name', tuple))
.sort_values('class_count', ascending=False, ignore_index=True)
)
output_df
:
student name class_count classes
0 ryan 3 (biology, statistics, ecology)
1 sarah 3 (biology, statistics, ecology)
2 ed 2 (biology, statistics)
3 tom 2 (biology, ecology)
4 austin 1 (ecology,)
5 de 1 (statistics,)
6 jackson 1 (biology,)
7 jacob 1 (statistics,)
8 sam 1 (ecology,)
我们可以进一步有条件地将classes/class加到class_count
上写成to_csv
:
# Conditionally Add Classes/Class
output_df['class_count'] = output_df['class_count'].astype(str) + np.where(
output_df['class_count'].eq(1),
' class',
' classes'
)
# Write to CSV
output_df.to_csv('output.csv', index=False, header=None)
output.csv
:
ryan,3 classes,"('biology', 'statistics', 'ecology')"
sarah,3 classes,"('biology', 'statistics', 'ecology')"
ed,2 classes,"('biology', 'statistics')"
tom,2 classes,"('biology', 'ecology')"
austin,1 class,"('ecology',)"
de,1 class,"('statistics',)"
jackson,1 class,"('biology',)"
jacob,1 class,"('statistics',)"
sam,1 class,"('ecology',)"
设置和导入:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
我有一个数据框,其中包含 3 列 classes,每列 class 有 5 行学生。其中一些学生是重复的。我想列出所有 classes 中最常见的学生姓名,并按降序排列,它们存在的次数以及它们存在的 classes。
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
biology statistics ecology
0 ryan sarah austin
1 sarah ed ryan
2 tom jacob tom
3 ed ryan sam
4 jackson de sarah
我希望输出看起来像这样:
ryan, 3 classes, (biology, statistics, ecology)
sarah, 3 classes, (biology, statistics, ecology)
tom, 2 classes, (biology, ecology)
ed, 2 classes, (biology, statistics)
jackson, 1 class, (biology)
jacob, 1 class, (statistics)
de, 1 class, (statistics)
austin, 1 class, (ecology)
...等等
如有任何帮助,我将不胜感激,我是初学者,所以我已经用了几个小时了。大脑正在被杀死。谢谢!
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
results = {}
for h in df:
for k,v in df[h].value_counts().items():
print(k,v)
if k in results:
results[k]['value'] += v
results[k]['class'].append(h)
else:
results[k] = {'value':v,'class':[h]}
results = {h:results[h] for h in sorted(results, key=lambda x:results[x]['value'],reverse=True)}
我们可以melt
the DataFrame to get to long form, then groupby aggregate
with Named Aggregation to get both the number of classes, and the names of the classes, lastly we can sort_values
先得到频率最高的学生:
output_df = (
df.melt(var_name='class name', value_name='student name')
.groupby('student name', as_index=False)
.agg(class_count=('class name', 'count'),
classes=('class name', tuple))
.sort_values('class_count', ascending=False, ignore_index=True)
)
output_df
:
student name class_count classes
0 ryan 3 (biology, statistics, ecology)
1 sarah 3 (biology, statistics, ecology)
2 ed 2 (biology, statistics)
3 tom 2 (biology, ecology)
4 austin 1 (ecology,)
5 de 1 (statistics,)
6 jackson 1 (biology,)
7 jacob 1 (statistics,)
8 sam 1 (ecology,)
我们可以进一步有条件地将classes/class加到class_count
上写成to_csv
:
# Conditionally Add Classes/Class
output_df['class_count'] = output_df['class_count'].astype(str) + np.where(
output_df['class_count'].eq(1),
' class',
' classes'
)
# Write to CSV
output_df.to_csv('output.csv', index=False, header=None)
output.csv
:
ryan,3 classes,"('biology', 'statistics', 'ecology')"
sarah,3 classes,"('biology', 'statistics', 'ecology')"
ed,2 classes,"('biology', 'statistics')"
tom,2 classes,"('biology', 'ecology')"
austin,1 class,"('ecology',)"
de,1 class,"('statistics',)"
jackson,1 class,"('biology',)"
jacob,1 class,"('statistics',)"
sam,1 class,"('ecology',)"
设置和导入:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})