使用相似矩阵的 sklearn 层次凝聚聚类

Question

给定一个距离矩阵，各个教授之间具有相似性：

              prof1     prof2     prof3
       prof1     0        0.8     0.9
       prof2     0.8      0       0.2
       prof3     0.9      0.2     0

我需要对这个数据进行层次聚类，上面的数据是二维矩阵的形式

       data_matrix=[[0,0.8,0.9],[0.8,0,0.2],[0.9,0.2,0]]

我尝试检查是否可以使用 sklearn.cluster AgglomerativeClustering 来实现它，但它正在将所有 3 行视为 3 个单独的向量而不是距离矩阵。可以使用这个或 scipy.cluster.hierarchy 来完成吗？

Answer 1

是的，您可以使用 sklearn 来完成。您需要设置：

affinity='precomputed'，使用距离矩阵
linkage='complete'或'average'，因为默认联动(Ward)只对坐标输入有效

使用预先计算的亲和力，输入矩阵被解释为观测值之间的距离矩阵。以下代码

from sklearn.cluster import AgglomerativeClustering
data_matrix = [[0,0.8,0.9],[0.8,0,0.2],[0.9,0.2,0]]
model = AgglomerativeClustering(affinity='precomputed', n_clusters=2, linkage='complete').fit(data_matrix)
print(model.labels_)

将 return 标签 [1 0 0]：第一位教授去一个集群，第二和第三个 - 另一个。

Answer 2

这里的输入 data_matrix 必须是距离矩阵，这与给定的相似度矩阵不同，因为两者都与度量完全相反，用一个代替其他会产生相当随意的结果。查看官方文档【如果“precomputed”，拟合方法需要距离矩阵（而不是相似度矩阵）作为输入。】：https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

作为一种解决方案，可以使用相似性 = 1 - 距离矩阵（假定距离矩阵在 0 和 1 之间归一化），然后将其用作输入。

我已经在几个示例中尝试过并验证了相同的方法，因此应该可以完成工作。

Answer 3

你也可以用 scipy.cluster.hierarchy:

from scipy.cluster.hierarchy import dendrogram, linkage, cut_tree
from matplotlib import pyplot as plt

# Data
X =[[0,0.8,0.9],[0.8,0,0.2],[0.9,0.2,0]]
labels = ['prof1','prof2','prof3']

# Perform clustering, you can choose the method
# in this case, we use 'ward'
Z = linkage(X, 'ward')

# Extract the membership to a cluster, either specify the n_clusters
# or the cut height
# (similar to sklearn labels)
print(cut_tree(Z, n_clusters=2))

# Visualize the clustering as a dendogram
fig = plt.figure(figsize=(25, 10))
dn = dendrogram(Z, orientation='right', labels=labels)
plt.show()

这将打印：

[[0]
 [1]
 [1]]

因为我们指定了 n_cluster 2 这意味着有 2 个集群。 prof1 属于集群 0，prof2 和 prof3 属于集群 1。您还可以指示 cut_height 而不是集群数。树状图如下所示：

![两个树状图][1] https://imgur.com/EF0cW4U.png "树状图"

使用相似矩阵的 sklearn 层次凝聚聚类

sklearn Hierarchical Agglomerative Clustering using similarity matrix

python

hierarchical-clustering

pandas

scikit-learn