通过从包含列表 pandas 的列中删除重复项来过滤数据框

Filter dataframe by removing duplicates from column containing list pandas

Dataframe 列包含列表中的字符串值。 Dataframe 需要转换为在列 'Final'

中包含具有唯一列表的行

我有如下数据框,

    string1           string2           Final
1   [abc,ncx]       [qwe, rty]        [apple, mango]
2   [uio,pas,dfg]   [zxc,vbg,dfv]     [banana,grapes, apple]
3   [ncx,abc]       [rty,qwe]         [mango,apple]
4   [uio,pas,dfg]   [zxc,vbg,dfv]     [banana,grapes, apple]
5   [uio,dfg]        [zxc,dfv]        [banana, apple]
6   [ncx,abc]       [rty,qwe]         [mango,apple]

df['final'] 列必须删除重复列表并转换数据框以包含 'final' 列中的唯一列表。

所需的输出数据帧:

     string1           string2           Final
1   [abc,ncx]       [qwe, rty]        [apple, mango]
2   [uio,pas,dfg]   [zxc,vbg,dfv]     [banana,grapes, apple]
3   [ncx,abc]       [rty,qwe]         [mango,apple]
4   [uio,dfg]        [zxc,dfv]        [banana, apple]

Series.duplicated, but because lists are not hashable first convert them to tuples and filter in boolean indexing 创建的 ~ 反转掩码:

df = df[~df['Final'].apply(tuple).duplicated()]
print (df)
         string1        string2                    Final
1      [abc,ncx]      [qwe,rty]           [apple, mango]
2  [uio,pas,dfg]  [zxc,vbg,dfv]  [banana, grapes, apple]
3      [ncx,abc]      [rty,qwe]           [mango, apple]
5      [uio,dfg]      [zxc,dfv]          [banana, apple]

如果 apple, mango 应该与 mango, apple 重复(顺序不重要)将 tuple 更改为 frozenset:

df = df[~df['Final'].apply(frozenset).duplicated()]
print (df)
         string1        string2                    Final
1      [abc,ncx]      [qwe,rty]           [apple, mango]
2  [uio,pas,dfg]  [zxc,vbg,dfv]  [banana, grapes, apple]
5      [uio,dfg]      [zxc,dfv]          [banana, apple]