使用 pandas 在数据框中进行聚类

Question

我需要帮助来处理 pandas 和标签这是一个选项卡：

Col1    Col2
A   B
C   B
D   B
E   F
G   F
F   A
Z   Y
H   Y
L   P

我想从这个选项卡创建集群并获得一个新选项卡，例如：

Cluster Names
Cluster1    A
Cluster1    B
Cluster1    C
Cluster1    D
Cluster1    F
Cluster1    E
Cluster1    G
Cluster2    Z
Cluster2    Y
Cluster2    H
Cluster3    L
Cluster3    P

如您所见，字母 A B C D E F 和 G 位于 Cluster1 中，因为它们都有一些共同点。

`A` and `B` are in the same line (A and B forme the `Cluster1`)
`C` and `B` are in the same line (C includes the `Cluster1`)
`D` and `B` are in the same line (D includes the `Cluster1`)
`F` and `A` are in the same line (F includes the `Cluster1`)
`E` and `F` are in the same line (E includes the `Cluster1`)
`G` and `F` are in the same line (G includes the `Cluster1`)

`Z` and `Y` are in the same line (Z and Y create the `Cluster2`)
`H` and `Y` are in the same line (H includes the `Cluster2`)

`L` and `P` are in the same line (L and P create the `Cluster3`)

有人知道使用 pandas 吗？

Answer 1

这是一个图形问题 connected components, I suggest you use networkx.connected_components:

import networkx as nx

g = nx.from_pandas_edgelist(df, source='Col1', target='Col2', create_using=nx.Graph)

for component in nx.connected_components(g):
    print(component)

输出

{'E', 'G', 'C', 'D', 'F', 'A', 'B'}
{'Y', 'H', 'Z'}
{'L', 'P'}

请注意，组件与输出的组相匹配。要将其转换为 DataFrame，请执行以下操作：

data = [[f'Cluster{i}', element] for i, component in enumerate(nx.connected_components(g), 1) for element in component]

result = pd.DataFrame(data=data, columns=['Cluster', 'Names'])
print(result)

输出

     Cluster Names
0   Cluster1     D
1   Cluster1     A
2   Cluster1     B
3   Cluster1     G
4   Cluster1     C
5   Cluster1     F
6   Cluster1     E
7   Cluster2     Z
8   Cluster2     Y
9   Cluster2     H
10  Cluster3     L
11  Cluster3     P

使用 pandas 在数据框中进行聚类

Use pandas to cluster in a dataframe

cluster-analysis

pandas