按数据框中一列中的相似列表分组

Grouping by similar lists in a column within a dataframe

我有一个包含一列列表的数据框。 我想对具有相似列表的行进行分组,而不考虑列表中项目的顺序。每个列表可以在列中出现多次。我希望根据列中出现的次数对分组列表进行排序。

data = [['a', ['tiger', 'cat', 'lion']], ['b', ['dolphin', 'goldfish', 'shark']], ['c', ['lion', 'cat', 'tiger']], ['d', ['bee', 'cat', 'tiger']],\
       ['e', ['cat', 'lion', 'tiger']],  ['f', ['cat', 'bee', 'tiger']], ['g', ['shark', 'goldfish', 'dolphin']]]
df = pd.DataFrame(data)
df.columns = ['ID', 'animals']
df
   ID   animals
0   a   [tiger, cat, lion]
1   b   [dolphin, goldfish, shark]
2   c   [lion, cat, tiger]
3   d   [bee, cat, tiger]
4   e   [cat, lion, tiger]
5   f   [cat, bee, tiger]
6   g   [shark, goldfish, dolphin]

我想对上述数据框中的相似列表进行分组。列表中动物的顺序可以不同。 我目前正在使用以下代码来执行此操作:

import collections as cs
animals_grouped = pd.DataFrame()
for q in range(len(df)):
    for r in range(len(df)):
        if (cs.Counter(df.iloc[q]['animals']) == cs.Counter(df.iloc[r]['animals'])):
            animals_grouped = animals_grouped.append(df.iloc[[r]], ignore_index = True)
            
animals_grouped.drop_duplicates('ID').reset_index(drop = True)

结果:

animals_grouped

    ID  animals
0   a   [tiger, cat, lion]
1   c   [lion, cat, tiger]
2   e   [cat, lion, tiger]
3   b   [dolphin, goldfish, shark]
4   g   [shark, goldfish, dolphin]
5   d   [bee, cat, tiger]
6   f   [cat, bee, tiger

考虑到我的原始数据框中有 100,000 多行,这个嵌套 for 循环的替代方法是什么。

data = [['a', ['tiger', 'cat', 'lion']], ['b', ['dolphin', 'goldfish', 'shark']], ['c', ['lion', 'cat', 'tiger']], ['d', ['bee', 'cat', 'tiger']],\
       ['e', ['cat', 'lion', 'tiger']],  ['f', ['cat', 'bee', 'tiger']], ['g', ['shark', 'goldfish', 'dolphin']]]
df = pd.DataFrame(data)
df.columns = ['ID', 'animals']
df1 = df.assign(temp=df.animals.apply(lambda x: ''.join(sorted(x))))
df = df1.assign(temp2 =df1.groupby(df1['temp'].values)['temp'].transform('count')).sort_values(['temp2','temp'], ascending=False).drop(['temp','temp2'], 1)

输出:

  ID                     animals
0  a          [tiger, cat, lion]
2  c          [lion, cat, tiger]
4  e          [cat, lion, tiger]
1  b  [dolphin, goldfish, shark]
6  g  [shark, goldfish, dolphin]
3  d           [bee, cat, tiger]
5  f           [cat, bee, tiger]

您可以通过对列表进行排序来创建临时排序键,对 df 进行排序然后删除它。

(
    df.assign(sort_key = df.animals.apply(sorted))
    .sort_values('sort_key')
    .drop('sort_key', axis=1)
)

    ID  animals
0   a   [cat, lion, tiger]
2   c   [cat, lion, tiger]
1   b   [dolphin, goldfish, shark]