如何使 fcluster 达到 return 与 cut_tree 相同的输出?
How to make fcluster to return the same output as cut_tree?
有几个相关的问题,我认为最相关的是。
假设我有一个这样的数据集(出于演示目的高度简化):
import numpy as np
import pandas as pd
from scipy.spatial import distance
from scipy.cluster import hierarchy
val = np.array([[0.20288834, 0.80406494, 4.59921579, 14.28184739],
[0.22477082, 1.43444223, 6.87992605, 12.90299896],
[0.22811485, 0.74509454, 3.85198421, 19.22564266],
[0.20374529, 0.73680174, 3.63178517, 17.82544951],
[0.22722696, 0.86113728, 3.00832186, 16.62306058],
[0.25577882, 0.85671779, 3.70655719, 17.49690061],
[0.23018219, 0.68039151, 2.50815837, 15.09039053],
[0.21638751, 1.12455083, 3.56246872, 18.82866991],
[0.26600895, 1.09415595, 2.85300018, 17.93139433],
[0.22369445, 0.73689845, 3.24919113, 18.60914745]])
df = pd.DataFrame(val, columns=["C{}".format(i) for i in range(val.shape[1])])
C0 C1 C2 C3
0 0.202888 0.804065 4.599216 14.281847
1 0.224771 1.434442 6.879926 12.902999
2 0.228115 0.745095 3.851984 19.225643
3 0.203745 0.736802 3.631785 17.825450
4 0.227227 0.861137 3.008322 16.623061
5 0.255779 0.856718 3.706557 17.496901
6 0.230182 0.680392 2.508158 15.090391
7 0.216388 1.124551 3.562469 18.828670
8 0.266009 1.094156 2.853000 17.931394
9 0.223694 0.736898 3.249191 18.609147
我想对这个数据框的列进行聚类,从而也指定我获得的聚类数。通常,这可以通过使用 cut_tree function.
来实现
但是,目前,cut_tree
is broken and therefore I looked for alternatives which led me to the link at the beginning of this post where it is suggested to use fcluster 作为备选方案。
问题是我看不到如何使用 maxclust
参数指定确切的聚类数量,而只指定最大数量。
所以对于上面的简单示例,我可以这样做:
# number of target cluster
n_clusters = range(1, 5)
for n_clust in n_clusters:
Z = hierarchy.linkage(distance.pdist(df.T.values), method='average', metric='euclidean')
print("--------\nValues from flcuster:\n{}".format(hierarchy.fcluster(Z, n_clust, criterion='maxclust')))
print("\nValues from cut_tree:\n{}".format(hierarchy.cut_tree(Z, n_clust).T))
打印
Values from flcuster:
[1 1 1 1]
Values from cut_tree:
[[0 0 0 0]]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 0 0 1]]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 0 1 2]]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 1 2 3]]
正如您所见,fcluster
returns 最多 2 个不同的集群,而 cut_tree
returns 所需的数量。
在 cut_tree
中的错误被修复之前,有没有办法在一段时间内为 fcluster
获得相同的输出?如果没有,在另一个包中是否有其他好的替代方案?
不确定如何从 fcluster
中获得正确数量的簇。
作为替代方案,scikit-learn 具有 AgglomerativeClustering
:
from sklearn.cluster import AgglomerativeClustering
# number of target cluster
n_clusters = range(1, 5)
for n_clust in n_clusters:
Z = hierarchy.linkage(distance.pdist(df.T.values), method='average', metric='euclidean')
print("--------\nValues from flcuster:\n{}".format(hierarchy.fcluster(Z, n_clust, criterion='maxclust')))
print("\nValues from cut_tree:\n{}".format(hierarchy.cut_tree(Z, n_clust).T))
print("\nValues from AgglomerativeClustering:\n{}".format(AgglomerativeClustering(n_clusters=n_clust, affinity='euclidean', linkage='average').fit(df.T.values).labels_))
哪个 returns 所提供数据集的正确簇数(尽管顺序不同):
Values from flcuster:
[1 1 1 1]
Values from cut_tree:
[[0 0 0 0]]
Values from AgglomerativeClustering:
[0 0 0 0]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 0 0 1]]
Values from AgglomerativeClustering:
[0 0 0 1]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 0 1 2]]
Values from AgglomerativeClustering:
[0 0 2 1]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 1 2 3]]
Values from AgglomerativeClustering:
[3 1 2 0]
有几个相关的问题,我认为最相关的是
假设我有一个这样的数据集(出于演示目的高度简化):
import numpy as np
import pandas as pd
from scipy.spatial import distance
from scipy.cluster import hierarchy
val = np.array([[0.20288834, 0.80406494, 4.59921579, 14.28184739],
[0.22477082, 1.43444223, 6.87992605, 12.90299896],
[0.22811485, 0.74509454, 3.85198421, 19.22564266],
[0.20374529, 0.73680174, 3.63178517, 17.82544951],
[0.22722696, 0.86113728, 3.00832186, 16.62306058],
[0.25577882, 0.85671779, 3.70655719, 17.49690061],
[0.23018219, 0.68039151, 2.50815837, 15.09039053],
[0.21638751, 1.12455083, 3.56246872, 18.82866991],
[0.26600895, 1.09415595, 2.85300018, 17.93139433],
[0.22369445, 0.73689845, 3.24919113, 18.60914745]])
df = pd.DataFrame(val, columns=["C{}".format(i) for i in range(val.shape[1])])
C0 C1 C2 C3
0 0.202888 0.804065 4.599216 14.281847
1 0.224771 1.434442 6.879926 12.902999
2 0.228115 0.745095 3.851984 19.225643
3 0.203745 0.736802 3.631785 17.825450
4 0.227227 0.861137 3.008322 16.623061
5 0.255779 0.856718 3.706557 17.496901
6 0.230182 0.680392 2.508158 15.090391
7 0.216388 1.124551 3.562469 18.828670
8 0.266009 1.094156 2.853000 17.931394
9 0.223694 0.736898 3.249191 18.609147
我想对这个数据框的列进行聚类,从而也指定我获得的聚类数。通常,这可以通过使用 cut_tree function.
来实现但是,目前,cut_tree
is broken and therefore I looked for alternatives which led me to the link at the beginning of this post where it is suggested to use fcluster 作为备选方案。
问题是我看不到如何使用 maxclust
参数指定确切的聚类数量,而只指定最大数量。
所以对于上面的简单示例,我可以这样做:
# number of target cluster
n_clusters = range(1, 5)
for n_clust in n_clusters:
Z = hierarchy.linkage(distance.pdist(df.T.values), method='average', metric='euclidean')
print("--------\nValues from flcuster:\n{}".format(hierarchy.fcluster(Z, n_clust, criterion='maxclust')))
print("\nValues from cut_tree:\n{}".format(hierarchy.cut_tree(Z, n_clust).T))
打印
Values from flcuster:
[1 1 1 1]
Values from cut_tree:
[[0 0 0 0]]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 0 0 1]]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 0 1 2]]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 1 2 3]]
正如您所见,fcluster
returns 最多 2 个不同的集群,而 cut_tree
returns 所需的数量。
在 cut_tree
中的错误被修复之前,有没有办法在一段时间内为 fcluster
获得相同的输出?如果没有,在另一个包中是否有其他好的替代方案?
不确定如何从 fcluster
中获得正确数量的簇。
作为替代方案,scikit-learn 具有 AgglomerativeClustering
:
from sklearn.cluster import AgglomerativeClustering
# number of target cluster
n_clusters = range(1, 5)
for n_clust in n_clusters:
Z = hierarchy.linkage(distance.pdist(df.T.values), method='average', metric='euclidean')
print("--------\nValues from flcuster:\n{}".format(hierarchy.fcluster(Z, n_clust, criterion='maxclust')))
print("\nValues from cut_tree:\n{}".format(hierarchy.cut_tree(Z, n_clust).T))
print("\nValues from AgglomerativeClustering:\n{}".format(AgglomerativeClustering(n_clusters=n_clust, affinity='euclidean', linkage='average').fit(df.T.values).labels_))
哪个 returns 所提供数据集的正确簇数(尽管顺序不同):
Values from flcuster:
[1 1 1 1]
Values from cut_tree:
[[0 0 0 0]]
Values from AgglomerativeClustering:
[0 0 0 0]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 0 0 1]]
Values from AgglomerativeClustering:
[0 0 0 1]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 0 1 2]]
Values from AgglomerativeClustering:
[0 0 2 1]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 1 2 3]]
Values from AgglomerativeClustering:
[3 1 2 0]