Scipy的cut_tree()没有return请求的簇数，用scipy和fastcluster得到的连接矩阵不匹配

Question

我正在进行凝聚层次聚类（AHC) experiment using the fastcluster package in connection with scipy.cluster.hierarchy module functions, in Python 3, and I found a puzzling behaviour of the cut_tree() 函数。

我使用 linkage_vector() 和 method=ward 毫无问题地聚类数据并得到一个链接矩阵 Z。然后，我想切割树状图树以获得固定数量的簇（例如 33），我使用 cut_tree(Z, n_clusters=33) 正确执行此操作。（请记住，AHC 是一种确定性方法，它会生成连接所有数据点的二叉树，这些数据点位于树的叶子上；您可以在任何级别查看这棵树，最终达到 "see" 您想要的簇数; cut_tree() 所做的只是 return 一组从 0 到 n_clusters - 1 的 'n_cluster' 整数标签，归因于数据集的每个点。）

我在其他实验中做过很多次，我总是得到我请求的集群数。问题是，对于这个数据集，当我向 cut_tree() 询问 33 个集群时，它只给了我 32 个。我不明白为什么会这样。这可能是一个错误吗？您是否知道 cut_tree() 的任何错误？我尝试调试此行为并使用 scipy 的 linkage() 函数执行相同的聚类实验。将生成的链接矩阵作为 cut_tree() 的输入，我没有得到意外数量的簇作为输出。我也验证了两种方法输出的连锁矩阵不相等

我正在使用的 [dataset] 包含 10680 个向量，每个向量有 20 个维度。检查以下实验：

import numpy as np
import fastcluster as fc
import scipy.cluster.hierarchy as hac
from scipy.spatial.distance import pdist

### *Load dataset (10680 vectors, each with 20 dimensions)*
X = np.load('dataset.npy')

### *Hierarchical clustering using traditional scipy method*
dists = pdist(X)
Z_1 = hac.linkage(dists, method='ward')

### *Hierarchical clustering using optimized fastcluster method*
Z_2 = fc.linkage_vector(X, method='ward')

### *Comparissons*

## Are the linkage matrices equal?
print("Z_1 == Z_2 ? ", np.allclose(Z_1, Z_2))

## Is scipy's cut_tree() returning the requested number of clusters when using Z_2?
print("Req.\tGot\tequal?")
for i in range(1,50):
    cut = hac.cut_tree(Z_2, i)
    uniq = len(np.unique(cut))
    print(i,"\t",uniq,"\t",i==uniq)

## The same as before, but in condensed form. When requesting cut_tree() for clusters
#  in the range [1,50] does it return wrong results at some point?
print("Any problem cutting Z_1 for n_clusters in [1,50]? ", not np.all([len(np.unique(
                                      hac.cut_tree(Z_1, i)))==i for i in range(1,50)]))
print("Any problem cutting Z_2 for n_clusters in [1,50]? ", not np.all([len(np.unique(
                                      hac.cut_tree(Z_2, i)))==i for i in range(1,50)]))

#Output:
#
#Z_1 == Z_2 ?  False
#
#Req.    Got     equal?
#1        1       True
#2        2       True
#3        3       True
#4        4       True
#5        5       True
#6        6       True
#7        7       True
#8        8       True
#9        9       True
#10       10      True
#11       11      True
#12       12      True
#13       13      True
#14       14      True
#15       15      True
#16       16      True
#17       17      True
#18       18      True
#19       19      True
#20       20      True
#21       21      True
#22       22      True
#23       23      True
#24       24      True
#25       25      True
#26       26      True
#27       27      True
#28       28      True
#29       29      True
#30       30      True
#31       31      True
#32       32      True
#33       32      False
#34       33      False
#35       34      False
#36       35      False
#37       36      False
#38       37      False
#39       38      False
#40       39      False
#41       40      False
#42       41      False
#43       42      False
#44       43      False
#45       44      False
#46       45      False
#47       46      False
#48       47      False
#49       48      False
#
#Any problem cutting Z_1 for n_clusters in [1,50]?  False
#Any problem cutting Z_2 for n_clusters in [1,50]?  True

您可能已经注意到数据集包含 37 个向量，至少有一个副本，并且计算所有副本，数据集中共有 55 个向量，至少有一个副本。

为了检查，我决定将树状图树绘制到两个链接矩阵的浅深度级别，您可以在下面的图像中看到（顶部的 Z_1 和 Z_2在底部）。括号内的数字表示该 b运行ch 中包含的人口；没有括号的数字是树的叶子（数字是 X 矩阵中向量的索引）。人们可以看到唯一的区别（在绘图级别）是在标有红色方块的 b运行处，它们在 0 距离处合并，因为它们包含重叠向量。

所以，我再次运行聚类过程，如前面的代码所示，但这次只使用包含至少有一个副本的 55 个向量的数据子集。我通过以下方式获得了 X_subset：

uniqs, uniqs_indices, uniqs_count = np.unique(X, axis=0, return_index=True, return_counts=True)
duplicate_rows_indices = list( set(range(len(X))) - set(uniqs_indices) )
number_of_duplicate_rows = len(X)-len(uniqs) # 37

all_duplicate_rows = set()
for i in duplicate_rows_indices:
    _rows = set(np.where(X == X[i])[0])
    for j in _rows:
        all_duplicate_rows.add(j)

rows_with_at_least_a_copy = list(all_duplicate_rows)
number_of_rows_with_at_least_a_copy = len(rows_with_at_least_a_copy)  # 55

X_subset = X[rows_with_at_least_a_copy]

我这次的输出是：

#Z_1 == Z_2 ?  False
#Req.    Got     equal?
#1        1       True
#2        2       True
#3        2       False
#4        3       False
#5        4       False
#6        5       False
#7        6       False
#8        7       False
#9        8       False
#10       9       False
#11       10      False
#12       11      False
#13       12      False
#14       13      False
#15       14      False
#16       15      False
#17       16      False
#18       17      False
#19       18      False
#20       20      True
#21       21      True
#22       22      True
#23       23      True
#24       24      True
#25       25      True
#26       26      True
#27       27      True
#28       28      True
#29       29      True
#30       30      True
#31       31      True
#32       32      True
#33       33      True
#34       34      True
#35       35      True
#36       36      True
#37       37      True
#38       38      True
#39       39      True
#40       40      True
#41       41      True
#42       42      True
#43       43      True
#44       44      True
#45       45      True
#46       46      True
#47       47      True
#48       48      True
#49       49      True
#Any problem cutting Z_1 for n_clusters in [1,50]?  False
#Any problem cutting Z_2 for n_clusters in [1,50]?  True

因此，fastcluster 和 scipy 不是 return 相同的结果，如果只是由于重叠点，这可能是可以接受的，因为聚类情况的模糊性。但问题是 cut_tree() 在给定由 linkage_vector() 获得的链接矩阵时，在这些情况下有时 return 所请求的簇数。如何解决？

使用的库版本：scipy '0.19.1', numpy '1.13.3', fastcluster '1.1.24'

编辑： 也贴在这里：https://github.com/scipy/scipy/issues/7977.

Answer 1

首先，我不关心scipy.cluster.hierarchy.linkage()、fastcluster.linkage()、fastcluster.linkage_vector()三种方法输出的差异。差异可能有两个原因：

由于浮点运算的限制（代数等价公式产生不同的结果），聚类距离存在细微差异。
除了算术差异外，允许算法以不同方式解析关系。在这种情况下，“平局”是两对簇 (A,B)，(C,D)，(C,D) 之间的距离完全相同=42=]A和B，以及C和D。正如OP所写，例如在过程开始时有许多距离为 0 的点对，算法可以按任何顺序聚类。

其次，为什么 scipy.cluster.hierarchy.cut_tree() 没有产生所需数量的簇是这里真正有趣的问题。在代数上，“Ward”聚类方法不能在树状图中产生所谓的“反转”。（当一步的聚类距离大于下一步的聚类距离时，就会发生倒置。）即逐步树状图（linkage()的输出）第三列的距离序列是理想的“Ward”方法的单调递增序列。然而，由于浮点数不准确，在 OP 提供的数据集中，步骤 35 中的聚类距离被 fastcluster.linkage_vector() 报告为 2.2e−16，但在下一步 36.

中报告为 0.0 不幸的是，

SciPy 的 cut_tree() 不能很好地处理这些倒置。即使它们发生在树状图中的“深处”（如本例），这也会混淆整个合并过程其余部分的算法，并产生错误形成的簇。（乍一看，我认为树状图不仅在错误的级别“切割”，而且聚类实际上是错误的。不过我还没有完全分析情况。）

这更令人不安，因为不仅像本例中的数值不准确会导致倒置，而且“质心”和“中值”聚类方法经常会产生倒置，即使使用完美的算法也是如此。

最后，一个不完美的解决方案：为了解决问题，将SciPy的cut_tree()函数中循环中标记的两行替换为：

for i, node in enumerate(nodes):
    idx = node.pre_order()
    this_group = last_group.copy()
    # ------------- Replace this:
    this_group[idx] = last_group[idx].min()
    this_group[this_group > last_group[idx].max()] -= 1
    # -------------
    if i + 1 in cols_idx:
        groups[np.where(i + 1 == cols_idx)[0]] = this_group
    last_group = this_group

通过以下几行：

    unique_idx = np.unique(last_group[idx])
    this_group[idx] = unique_idx[0]
    for ui in unique_idx[:0:-1]:
        this_group[this_group > ui] -= 1

（查看 SciPy source code 了解上下文。）

但是，我宁愿建议从头开始重新实现 cut_tree() 方法，因为当前的实现过于复杂，并且在运行时复杂性方面可以更有效地执行任务。

Scipy的cut_tree()没有return请求的簇数，用scipy和fastcluster得到的连接矩阵不匹配

Scipy's cut_tree() doesn't return requested number of clusters and the linkage matrices obtained with scipy and fastcluster do not match

python

debugging

hierarchical-clustering

data-analysis

scipy