按数据框中一列中的相似列表分组
Grouping by similar lists in a column within a dataframe
我有一个包含一列列表的数据框。
我想对具有相似列表的行进行分组,而不考虑列表中项目的顺序。每个列表可以在列中出现多次。我希望根据列中出现的次数对分组列表进行排序。
data = [['a', ['tiger', 'cat', 'lion']], ['b', ['dolphin', 'goldfish', 'shark']], ['c', ['lion', 'cat', 'tiger']], ['d', ['bee', 'cat', 'tiger']],\
['e', ['cat', 'lion', 'tiger']], ['f', ['cat', 'bee', 'tiger']], ['g', ['shark', 'goldfish', 'dolphin']]]
df = pd.DataFrame(data)
df.columns = ['ID', 'animals']
df
ID animals
0 a [tiger, cat, lion]
1 b [dolphin, goldfish, shark]
2 c [lion, cat, tiger]
3 d [bee, cat, tiger]
4 e [cat, lion, tiger]
5 f [cat, bee, tiger]
6 g [shark, goldfish, dolphin]
我想对上述数据框中的相似列表进行分组。列表中动物的顺序可以不同。
我目前正在使用以下代码来执行此操作:
import collections as cs
animals_grouped = pd.DataFrame()
for q in range(len(df)):
for r in range(len(df)):
if (cs.Counter(df.iloc[q]['animals']) == cs.Counter(df.iloc[r]['animals'])):
animals_grouped = animals_grouped.append(df.iloc[[r]], ignore_index = True)
animals_grouped.drop_duplicates('ID').reset_index(drop = True)
结果:
animals_grouped
ID animals
0 a [tiger, cat, lion]
1 c [lion, cat, tiger]
2 e [cat, lion, tiger]
3 b [dolphin, goldfish, shark]
4 g [shark, goldfish, dolphin]
5 d [bee, cat, tiger]
6 f [cat, bee, tiger
考虑到我的原始数据框中有 100,000 多行,这个嵌套 for 循环的替代方法是什么。
data = [['a', ['tiger', 'cat', 'lion']], ['b', ['dolphin', 'goldfish', 'shark']], ['c', ['lion', 'cat', 'tiger']], ['d', ['bee', 'cat', 'tiger']],\
['e', ['cat', 'lion', 'tiger']], ['f', ['cat', 'bee', 'tiger']], ['g', ['shark', 'goldfish', 'dolphin']]]
df = pd.DataFrame(data)
df.columns = ['ID', 'animals']
df1 = df.assign(temp=df.animals.apply(lambda x: ''.join(sorted(x))))
df = df1.assign(temp2 =df1.groupby(df1['temp'].values)['temp'].transform('count')).sort_values(['temp2','temp'], ascending=False).drop(['temp','temp2'], 1)
输出:
ID animals
0 a [tiger, cat, lion]
2 c [lion, cat, tiger]
4 e [cat, lion, tiger]
1 b [dolphin, goldfish, shark]
6 g [shark, goldfish, dolphin]
3 d [bee, cat, tiger]
5 f [cat, bee, tiger]
您可以通过对列表进行排序来创建临时排序键,对 df 进行排序然后删除它。
(
df.assign(sort_key = df.animals.apply(sorted))
.sort_values('sort_key')
.drop('sort_key', axis=1)
)
ID animals
0 a [cat, lion, tiger]
2 c [cat, lion, tiger]
1 b [dolphin, goldfish, shark]
我有一个包含一列列表的数据框。 我想对具有相似列表的行进行分组,而不考虑列表中项目的顺序。每个列表可以在列中出现多次。我希望根据列中出现的次数对分组列表进行排序。
data = [['a', ['tiger', 'cat', 'lion']], ['b', ['dolphin', 'goldfish', 'shark']], ['c', ['lion', 'cat', 'tiger']], ['d', ['bee', 'cat', 'tiger']],\
['e', ['cat', 'lion', 'tiger']], ['f', ['cat', 'bee', 'tiger']], ['g', ['shark', 'goldfish', 'dolphin']]]
df = pd.DataFrame(data)
df.columns = ['ID', 'animals']
df
ID animals
0 a [tiger, cat, lion]
1 b [dolphin, goldfish, shark]
2 c [lion, cat, tiger]
3 d [bee, cat, tiger]
4 e [cat, lion, tiger]
5 f [cat, bee, tiger]
6 g [shark, goldfish, dolphin]
我想对上述数据框中的相似列表进行分组。列表中动物的顺序可以不同。 我目前正在使用以下代码来执行此操作:
import collections as cs
animals_grouped = pd.DataFrame()
for q in range(len(df)):
for r in range(len(df)):
if (cs.Counter(df.iloc[q]['animals']) == cs.Counter(df.iloc[r]['animals'])):
animals_grouped = animals_grouped.append(df.iloc[[r]], ignore_index = True)
animals_grouped.drop_duplicates('ID').reset_index(drop = True)
结果:
animals_grouped
ID animals
0 a [tiger, cat, lion]
1 c [lion, cat, tiger]
2 e [cat, lion, tiger]
3 b [dolphin, goldfish, shark]
4 g [shark, goldfish, dolphin]
5 d [bee, cat, tiger]
6 f [cat, bee, tiger
考虑到我的原始数据框中有 100,000 多行,这个嵌套 for 循环的替代方法是什么。
data = [['a', ['tiger', 'cat', 'lion']], ['b', ['dolphin', 'goldfish', 'shark']], ['c', ['lion', 'cat', 'tiger']], ['d', ['bee', 'cat', 'tiger']],\
['e', ['cat', 'lion', 'tiger']], ['f', ['cat', 'bee', 'tiger']], ['g', ['shark', 'goldfish', 'dolphin']]]
df = pd.DataFrame(data)
df.columns = ['ID', 'animals']
df1 = df.assign(temp=df.animals.apply(lambda x: ''.join(sorted(x))))
df = df1.assign(temp2 =df1.groupby(df1['temp'].values)['temp'].transform('count')).sort_values(['temp2','temp'], ascending=False).drop(['temp','temp2'], 1)
输出:
ID animals
0 a [tiger, cat, lion]
2 c [lion, cat, tiger]
4 e [cat, lion, tiger]
1 b [dolphin, goldfish, shark]
6 g [shark, goldfish, dolphin]
3 d [bee, cat, tiger]
5 f [cat, bee, tiger]
您可以通过对列表进行排序来创建临时排序键,对 df 进行排序然后删除它。
(
df.assign(sort_key = df.animals.apply(sorted))
.sort_values('sort_key')
.drop('sort_key', axis=1)
)
ID animals
0 a [cat, lion, tiger]
2 c [cat, lion, tiger]
1 b [dolphin, goldfish, shark]