将具有共同值的数据框行分组

Question

我有以下数据集：

    0   1   2   3
0   a   ❤     
1   b   ❤     
2   c       
3   d     ✨   
4   e   ❤

我想执行聚类以对具有共同点的 ROWS 进行分组。

通过在以下代码中使用 networkx，结果如下：

import networkx as nx
import matplotlib.pyplot as plt

G=nx.from_pandas_edgelist(df, 0, 1)
nx.draw(G, with_labels=True)
plt.show()

output: groups obtained with networkx

我怎样才能同时考虑第 2 列和第 3 列？我是否也可以在不给予任何特定列优先级的情况下执行此操作（例如，我希望第 2 列与第 1 列同等重要）？

Answer 1

类似于, you could have each dataframe raw be a path, and look for the connected components。我添加了一行，但没有任何其他行的共同值，以更好地说明其工作原理：

print(df)
   0  1   2    3
0  a  ❤    
1  b  ❤    
2  c      
3  d    ✨  
4  e  ❤    
5  f

因此遍历数据帧行，并将它们添加为 nx.add_path:

的路径

my_list = df.values.tolist()
G=nx.Graph()
for path in my_list:
    nx.add_path(G, path)
components = list(nx.connected_components(G))

print(components)
[{'a', 'b', 'c', 'd', 'e', '✨', '❤', '', '', '', '', ''},
 {'f', '', '', ''}]

现在您可以遍历这些组，如果它是组件的 subset，则将每一行添加到嵌套列表中的新子列表：

groups = []
for component in components:
    group = []
    for path in my_list:
        if component.issuperset(path):
            group.append(path)
    groups.append(group)

在这种情况下，您会将除最后一行之外的所有行组合在一起，最后一行在另一个组中。

print(groups)

[[['a', '❤', '', ''],
  ['b', '❤', '', ''],
  ['c', '', '', ''],
  ['d', '', '✨', ''],
  ['e', '❤', '', '']],
 [['f', '', '', '']]]

将具有共同值的数据框行分组

Group dataframe rows with common vales

grouping

cluster-analysis

networkx