使用 pandas 在数据框中进行聚类
Use pandas to cluster in a dataframe
我需要帮助来处理 pandas 和标签
这是一个选项卡:
Col1 Col2
A B
C B
D B
E F
G F
F A
Z Y
H Y
L P
我想从这个选项卡创建集群并获得一个新选项卡,例如:
Cluster Names
Cluster1 A
Cluster1 B
Cluster1 C
Cluster1 D
Cluster1 F
Cluster1 E
Cluster1 G
Cluster2 Z
Cluster2 Y
Cluster2 H
Cluster3 L
Cluster3 P
如您所见,字母 A B C D E F
和 G
位于 Cluster1
中,因为它们都有一些共同点。
`A` and `B` are in the same line (A and B forme the `Cluster1`)
`C` and `B` are in the same line (C includes the `Cluster1`)
`D` and `B` are in the same line (D includes the `Cluster1`)
`F` and `A` are in the same line (F includes the `Cluster1`)
`E` and `F` are in the same line (E includes the `Cluster1`)
`G` and `F` are in the same line (G includes the `Cluster1`)
`Z` and `Y` are in the same line (Z and Y create the `Cluster2`)
`H` and `Y` are in the same line (H includes the `Cluster2`)
`L` and `P` are in the same line (L and P create the `Cluster3`)
有人知道使用 pandas 吗?
这是一个图形问题 connected components, I suggest you use networkx.connected_components:
import networkx as nx
g = nx.from_pandas_edgelist(df, source='Col1', target='Col2', create_using=nx.Graph)
for component in nx.connected_components(g):
print(component)
输出
{'E', 'G', 'C', 'D', 'F', 'A', 'B'}
{'Y', 'H', 'Z'}
{'L', 'P'}
请注意,组件与输出的组相匹配。要将其转换为 DataFrame,请执行以下操作:
data = [[f'Cluster{i}', element] for i, component in enumerate(nx.connected_components(g), 1) for element in component]
result = pd.DataFrame(data=data, columns=['Cluster', 'Names'])
print(result)
输出
Cluster Names
0 Cluster1 D
1 Cluster1 A
2 Cluster1 B
3 Cluster1 G
4 Cluster1 C
5 Cluster1 F
6 Cluster1 E
7 Cluster2 Z
8 Cluster2 Y
9 Cluster2 H
10 Cluster3 L
11 Cluster3 P
我需要帮助来处理 pandas 和标签 这是一个选项卡:
Col1 Col2
A B
C B
D B
E F
G F
F A
Z Y
H Y
L P
我想从这个选项卡创建集群并获得一个新选项卡,例如:
Cluster Names
Cluster1 A
Cluster1 B
Cluster1 C
Cluster1 D
Cluster1 F
Cluster1 E
Cluster1 G
Cluster2 Z
Cluster2 Y
Cluster2 H
Cluster3 L
Cluster3 P
如您所见,字母 A B C D E F
和 G
位于 Cluster1
中,因为它们都有一些共同点。
`A` and `B` are in the same line (A and B forme the `Cluster1`)
`C` and `B` are in the same line (C includes the `Cluster1`)
`D` and `B` are in the same line (D includes the `Cluster1`)
`F` and `A` are in the same line (F includes the `Cluster1`)
`E` and `F` are in the same line (E includes the `Cluster1`)
`G` and `F` are in the same line (G includes the `Cluster1`)
`Z` and `Y` are in the same line (Z and Y create the `Cluster2`)
`H` and `Y` are in the same line (H includes the `Cluster2`)
`L` and `P` are in the same line (L and P create the `Cluster3`)
有人知道使用 pandas 吗?
这是一个图形问题 connected components, I suggest you use networkx.connected_components:
import networkx as nx
g = nx.from_pandas_edgelist(df, source='Col1', target='Col2', create_using=nx.Graph)
for component in nx.connected_components(g):
print(component)
输出
{'E', 'G', 'C', 'D', 'F', 'A', 'B'}
{'Y', 'H', 'Z'}
{'L', 'P'}
请注意,组件与输出的组相匹配。要将其转换为 DataFrame,请执行以下操作:
data = [[f'Cluster{i}', element] for i, component in enumerate(nx.connected_components(g), 1) for element in component]
result = pd.DataFrame(data=data, columns=['Cluster', 'Names'])
print(result)
输出
Cluster Names
0 Cluster1 D
1 Cluster1 A
2 Cluster1 B
3 Cluster1 G
4 Cluster1 C
5 Cluster1 F
6 Cluster1 E
7 Cluster2 Z
8 Cluster2 Y
9 Cluster2 H
10 Cluster3 L
11 Cluster3 P