我对 Davies-Bouldin 指数的 python 实施是否正确?
Is my python implementation of the Davies-Bouldin Index correct?
我正在尝试计算 Python 中的 Davies-Bouldin Index。
下面是代码尝试重现的步骤。
5 步:
- 对于每个簇,计算每个点到质心之间的欧氏距离
- 对于每个集群,计算这些距离的平均值
- 对于每对簇,计算它们质心之间的欧氏距离
然后,
- 对于每对簇,计算到它们各自质心的平均距离(在第 2 步计算)的总和,然后除以它们之间的距离(在第 3 步计算)。
最后,
- 计算所有这些分区(=所有索引)的平均值以获得整个聚类的 Davies-Bouldin 索引
代码
def daviesbouldin(X, labels, centroids):
import numpy as np
from scipy.spatial.distance import pdist, euclidean
nbre_of_clusters = len(centroids) #Get the number of clusters
distances = [[] for e in range(nbre_of_clusters)] #Store intra-cluster distances by cluster
distances_means = [] #Store the mean of these distances
DB_indexes = [] #Store Davies_Boulin index of each pair of cluster
second_cluster_idx = [] #Store index of the second cluster of each pair
first_cluster_idx = 0 #Set index of first cluster of each pair to 0
# Step 1: Compute euclidean distances between each point of a cluster to their centroid
for cluster in range(nbre_of_clusters):
for point in range(X[labels == cluster].shape[0]):
distances[cluster].append(euclidean(X[labels == cluster][point], centroids[cluster]))
# Step 2: Compute the mean of these distances
for e in distances:
distances_means.append(np.mean(e))
# Step 3: Compute euclidean distances between each pair of centroid
ctrds_distance = pdist(centroids)
# Tricky step 4: Compute Davies-Bouldin index of each pair of cluster
for i, e in enumerate(e for start in range(1, nbre_of_clusters) for e in range(start, nbre_of_clusters)):
second_cluster_idx.append(e)
if second_cluster_idx[i-1] == nbre_of_clusters - 1:
first_cluster_idx += 1
DB_indexes.append((distances_means[first_cluster_idx] + distances_means[e]) / ctrds_distance[i])
# Step 5: Compute the mean of all DB_indexes
print("DAVIES-BOULDIN Index: %.5f" % np.mean(DB_indexes))
参数中:
X
是数据
labels
,是由聚类算法计算的标签(即:kmeans)
centroids
是每个簇的质心坐标(即:cluster_centers_
)
此外,请注意我使用的是 Python 3
问题 1:计算每对质心之间的欧氏距离是否正确(第 3 步)?
问题 2:我对第 4 步的实施是否正确?
问题 3:我需要归一化簇内和簇间距离吗?
关于步骤 4 的进一步说明
假设我们有 10 个集群。
循环应该计算每对簇的数据库索引。
第一次迭代:
- 簇 1 的内距离平均值(
distances_means
的索引 0)和簇 2 的内距离平均值(distances_means
的索引 1)相加
- 将此总和除以 2 个聚类之间的距离(
ctrds_distance
的索引 0)
在第二次迭代时:
- 簇 1 的距离内平均值(
distances_means
的索引 0)和簇 3 的距离内平均值(distances_means
的索引 2)相加
- 将此总和除以 2 个聚类之间的距离(
ctrds_distance
的索引 1)
等等...
以10个集群为例,完整的迭代过程应该是这样的:
intra-cluster distance intra-cluster distance distance between their
of cluster: of cluster: centroids(storage num):
0 + 1 / 0
0 + 2 / 1
0 + 3 / 2
0 + 4 / 3
0 + 5 / 4
0 + 6 / 5
0 + 7 / 6
0 + 8 / 7
0 + 9 / 8
1 + 2 / 9
1 + 3 / 10
1 + 4 / 11
1 + 5 / 12
1 + 6 / 13
1 + 7 / 14
1 + 8 / 15
1 + 9 / 16
2 + 3 / 17
2 + 4 / 18
2 + 5 / 19
2 + 6 / 20
2 + 7 / 21
2 + 8 / 22
2 + 9 / 23
3 + 4 / 24
3 + 5 / 25
3 + 6 / 26
3 + 7 / 27
3 + 8 / 28
3 + 9 / 29
4 + 5 / 30
4 + 6 / 31
4 + 7 / 32
4 + 8 / 33
4 + 9 / 34
5 + 6 / 35
5 + 7 / 36
5 + 8 / 37
5 + 9 / 38
6 + 7 / 39
6 + 8 / 40
6 + 9 / 41
7 + 8 / 42
7 + 9 / 43
8 + 9 / 44
这里的问题是我不太确定 distances_means
的索引是否与 ctrds_distance
的索引匹配。
换句话说,我不确定计算的第一个簇间距离是否对应于簇 1 和簇 2 之间的距离。计算的第二个簇间距离是否对应于簇 3 和簇之间的距离集群 1... 依此类推,遵循上述模式。
简而言之:恐怕我是将成对的簇内距离除以不对应的簇间距离。
非常欢迎任何帮助!
这是上面的 Davies-Bouldin 指数原始实现的更短、更正后更快的版本。
def DaviesBouldin(X, labels):
n_cluster = len(np.bincount(labels))
cluster_k = [X[labels == k] for k in range(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
db = []
for i in range(n_cluster):
for j in range(n_cluster):
if j != i:
db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]))
return(np.max(db) / n_cluster)
回答我自己的问题:
- 初稿(第 4 步)的反驳是正确的,但不相关
- 不需要归一化簇内和簇间距离
- 计算欧氏距离时出错
请注意,您可以找到尝试改进该指数的创新方法,特别是“New Version of Davies-Bouldin Index”,它用柱面距离代替欧氏距离。
感谢您的实施。我只有一个问题:最后一行是否缺少一个分区。在最后一步中,max(db) 的值应除以实施的集群数。
def DaviesBouldin(Daten, DatenLabels):
n_cluster = len(np.bincount(DatenLabels))
cluster_k = [Daten[DatenLabels == k] for k in range(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)] # mittlere Entfernung zum jeweiligen Clusterzentrum
db = []
for i in range(n_cluster):
for j in range(n_cluster):
if j != i:
db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]) / n_cluster)
return(np.max(db))
也许我监督那个部门是因为我是 Python 的新手。但在我的图形中(我在一系列集群上迭代)DB.max 的值在开始时非常低,然后增加。在按集群数量缩放后,图形看起来更好(开始时高 DB.max 值,并随着集群数量的增加而不断下降)。
此致
感谢代码和修改 - 真的帮助我开始了。更短、更快的版本并不完全正确。我对其进行了修改,以正确地平均每个集群最相似集群的分散分数。
原始算法和解释见https://www.researchgate.net/publication/224377470_A_Cluster_Separation_Measure:
The DBI is the average of the similarity measures of each cluster with
its most similar cluster.
def DaviesBouldin(X, labels):
n_cluster = len(np.bincount(labels))
cluster_k = [X[labels == k] for k in range(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
# calculate cluster dispersion
S = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
Ri = []
for i in range(n_cluster):
Rij = []
# establish similarity between each cluster and all other clusters
for j in range(n_cluster):
if j != i:
r = (S[i] + S[j]) / euclidean(centroids[i], centroids[j])
Rij.append(r)
# select Ri value of most similar cluster
Ri.append(max(Rij))
# get mean of all Ri values
dbi = np.mean(Ri)
return dbi
这个比下面的代码快 20 倍,所有计算都在 numpy 中完成。
import numpy as np
from scipy.spatial.distance import euclidean, cdist, pdist, squareform
def db_index(X, y):
"""
Davies-Bouldin index is an internal evaluation method for
clustering algorithms. Lower values indicate tighter clusters that
are better separated.
"""
# get unique labels
if y.ndim == 2:
y = np.argmax(axis=1)
uniqlbls = np.unique(y)
n = len(uniqlbls)
# pre-calculate centroid and sigma
centroid_arr = np.empty((n, X.shape[1]))
sigma_arr = np.empty((n,1))
dbi_arr = np.empty((n,n))
mask_arr = np.invert(np.eye(n, dtype='bool'))
for i,k in enumerate(uniqlbls):
Xk = X[np.where(y==k)[0],...]
Ak = np.mean(Xk, axis=0)
centroid_arr[i,...] = Ak
sigma_arr[i,...] = np.mean(cdist(Xk, Ak.reshape(1,-1)))
# compute pairwise centroid distances, make diagonal elements non-zero
centroid_pdist_arr = squareform(pdist(centroid_arr)) + np.eye(n)
# compute pairwise sigma sums
sigma_psum_arr = squareform(pdist(sigma_arr, lambda u,v: u+v))
# divide
dbi_arr = np.divide(sigma_psum_arr, centroid_pdist_arr)
# get mean of max of off-diagonal elements
dbi_arr = np.where(mask_arr, dbi_arr, 0)
dbi = np.mean(np.max(dbi_arr, axis=1))
return dbi
这是一个使用 numpy 1.14、scipy 1.1.0 和 python 3 的实现。计算速度提高不多,但内存占用应该略小。
import numpy as np
from scipy.spatial.distance import euclidean, cdist, pdist, squareform
def db_index(X, y):
"""
Davies-Bouldin index is an internal evaluation method for
clustering algorithms. Lower values indicate tighter clusters that
are better separated.
Arguments
----------
X : 2D array (n_samples, embed_dim)
Vector for each example.
y : 1D array (n_samples,) or 2D binary array (n_samples, n_classes)
True labels for each example.
Returns
----------
dbi : float
Calculated Davies-Bouldin index.
"""
# get unique labels
if y.ndim == 2:
y = np.argmax(axis=1)
uniqlbls = np.unique(y)
n = len(uniqlbls)
# pre-calculate centroid and sigma
centroid_arr = np.empty((n, X.shape[1]))
sigma_arr = np.empty(n)
for i,k in enumerate(uniqlbls):
Xk = X[np.where(y==k)[0],...]
Ak = np.mean(Xk, axis=0)
centroid_arr[i,...] = Ak
sigma_arr[i,...] = np.mean(cdist(Xk, Ak.reshape(1,-1)))
# loop over non-duplicate cluster pairs
dbi = 0
for i in range(n):
max_Rij = 0
for j in range(n):
if j != i:
Rij = np.divide(sigma_arr[i] + sigma_arr[j],
euclidean(centroid_arr[i,...], centroid_arr[j,...]))
if Rij > max_Rij:
max_Rij = Rij
dbi += max_Rij
return dbi/n
我正在尝试计算 Python 中的 Davies-Bouldin Index。
下面是代码尝试重现的步骤。
5 步:
- 对于每个簇,计算每个点到质心之间的欧氏距离
- 对于每个集群,计算这些距离的平均值
- 对于每对簇,计算它们质心之间的欧氏距离
然后,
- 对于每对簇,计算到它们各自质心的平均距离(在第 2 步计算)的总和,然后除以它们之间的距离(在第 3 步计算)。
最后,
- 计算所有这些分区(=所有索引)的平均值以获得整个聚类的 Davies-Bouldin 索引
代码
def daviesbouldin(X, labels, centroids):
import numpy as np
from scipy.spatial.distance import pdist, euclidean
nbre_of_clusters = len(centroids) #Get the number of clusters
distances = [[] for e in range(nbre_of_clusters)] #Store intra-cluster distances by cluster
distances_means = [] #Store the mean of these distances
DB_indexes = [] #Store Davies_Boulin index of each pair of cluster
second_cluster_idx = [] #Store index of the second cluster of each pair
first_cluster_idx = 0 #Set index of first cluster of each pair to 0
# Step 1: Compute euclidean distances between each point of a cluster to their centroid
for cluster in range(nbre_of_clusters):
for point in range(X[labels == cluster].shape[0]):
distances[cluster].append(euclidean(X[labels == cluster][point], centroids[cluster]))
# Step 2: Compute the mean of these distances
for e in distances:
distances_means.append(np.mean(e))
# Step 3: Compute euclidean distances between each pair of centroid
ctrds_distance = pdist(centroids)
# Tricky step 4: Compute Davies-Bouldin index of each pair of cluster
for i, e in enumerate(e for start in range(1, nbre_of_clusters) for e in range(start, nbre_of_clusters)):
second_cluster_idx.append(e)
if second_cluster_idx[i-1] == nbre_of_clusters - 1:
first_cluster_idx += 1
DB_indexes.append((distances_means[first_cluster_idx] + distances_means[e]) / ctrds_distance[i])
# Step 5: Compute the mean of all DB_indexes
print("DAVIES-BOULDIN Index: %.5f" % np.mean(DB_indexes))
参数中:
X
是数据labels
,是由聚类算法计算的标签(即:kmeans)centroids
是每个簇的质心坐标(即:cluster_centers_
)
此外,请注意我使用的是 Python 3
问题 1:计算每对质心之间的欧氏距离是否正确(第 3 步)?
问题 2:我对第 4 步的实施是否正确?
问题 3:我需要归一化簇内和簇间距离吗?
关于步骤 4 的进一步说明
假设我们有 10 个集群。 循环应该计算每对簇的数据库索引。
第一次迭代:
- 簇 1 的内距离平均值(
distances_means
的索引 0)和簇 2 的内距离平均值(distances_means
的索引 1)相加 - 将此总和除以 2 个聚类之间的距离(
ctrds_distance
的索引 0)
在第二次迭代时:
- 簇 1 的距离内平均值(
distances_means
的索引 0)和簇 3 的距离内平均值(distances_means
的索引 2)相加 - 将此总和除以 2 个聚类之间的距离(
ctrds_distance
的索引 1)
等等...
以10个集群为例,完整的迭代过程应该是这样的:
intra-cluster distance intra-cluster distance distance between their
of cluster: of cluster: centroids(storage num):
0 + 1 / 0
0 + 2 / 1
0 + 3 / 2
0 + 4 / 3
0 + 5 / 4
0 + 6 / 5
0 + 7 / 6
0 + 8 / 7
0 + 9 / 8
1 + 2 / 9
1 + 3 / 10
1 + 4 / 11
1 + 5 / 12
1 + 6 / 13
1 + 7 / 14
1 + 8 / 15
1 + 9 / 16
2 + 3 / 17
2 + 4 / 18
2 + 5 / 19
2 + 6 / 20
2 + 7 / 21
2 + 8 / 22
2 + 9 / 23
3 + 4 / 24
3 + 5 / 25
3 + 6 / 26
3 + 7 / 27
3 + 8 / 28
3 + 9 / 29
4 + 5 / 30
4 + 6 / 31
4 + 7 / 32
4 + 8 / 33
4 + 9 / 34
5 + 6 / 35
5 + 7 / 36
5 + 8 / 37
5 + 9 / 38
6 + 7 / 39
6 + 8 / 40
6 + 9 / 41
7 + 8 / 42
7 + 9 / 43
8 + 9 / 44
这里的问题是我不太确定 distances_means
的索引是否与 ctrds_distance
的索引匹配。
换句话说,我不确定计算的第一个簇间距离是否对应于簇 1 和簇 2 之间的距离。计算的第二个簇间距离是否对应于簇 3 和簇之间的距离集群 1... 依此类推,遵循上述模式。
简而言之:恐怕我是将成对的簇内距离除以不对应的簇间距离。
非常欢迎任何帮助!
这是上面的 Davies-Bouldin 指数原始实现的更短、更正后更快的版本。
def DaviesBouldin(X, labels):
n_cluster = len(np.bincount(labels))
cluster_k = [X[labels == k] for k in range(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
db = []
for i in range(n_cluster):
for j in range(n_cluster):
if j != i:
db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]))
return(np.max(db) / n_cluster)
回答我自己的问题:
- 初稿(第 4 步)的反驳是正确的,但不相关
- 不需要归一化簇内和簇间距离
- 计算欧氏距离时出错
请注意,您可以找到尝试改进该指数的创新方法,特别是“New Version of Davies-Bouldin Index”,它用柱面距离代替欧氏距离。
感谢您的实施。我只有一个问题:最后一行是否缺少一个分区。在最后一步中,max(db) 的值应除以实施的集群数。
def DaviesBouldin(Daten, DatenLabels):
n_cluster = len(np.bincount(DatenLabels))
cluster_k = [Daten[DatenLabels == k] for k in range(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)] # mittlere Entfernung zum jeweiligen Clusterzentrum
db = []
for i in range(n_cluster):
for j in range(n_cluster):
if j != i:
db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]) / n_cluster)
return(np.max(db))
也许我监督那个部门是因为我是 Python 的新手。但在我的图形中(我在一系列集群上迭代)DB.max 的值在开始时非常低,然后增加。在按集群数量缩放后,图形看起来更好(开始时高 DB.max 值,并随着集群数量的增加而不断下降)。
此致
感谢代码和修改 - 真的帮助我开始了。更短、更快的版本并不完全正确。我对其进行了修改,以正确地平均每个集群最相似集群的分散分数。
原始算法和解释见https://www.researchgate.net/publication/224377470_A_Cluster_Separation_Measure:
The DBI is the average of the similarity measures of each cluster with its most similar cluster.
def DaviesBouldin(X, labels):
n_cluster = len(np.bincount(labels))
cluster_k = [X[labels == k] for k in range(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
# calculate cluster dispersion
S = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
Ri = []
for i in range(n_cluster):
Rij = []
# establish similarity between each cluster and all other clusters
for j in range(n_cluster):
if j != i:
r = (S[i] + S[j]) / euclidean(centroids[i], centroids[j])
Rij.append(r)
# select Ri value of most similar cluster
Ri.append(max(Rij))
# get mean of all Ri values
dbi = np.mean(Ri)
return dbi
这个比下面的代码快 20 倍,所有计算都在 numpy 中完成。
import numpy as np
from scipy.spatial.distance import euclidean, cdist, pdist, squareform
def db_index(X, y):
"""
Davies-Bouldin index is an internal evaluation method for
clustering algorithms. Lower values indicate tighter clusters that
are better separated.
"""
# get unique labels
if y.ndim == 2:
y = np.argmax(axis=1)
uniqlbls = np.unique(y)
n = len(uniqlbls)
# pre-calculate centroid and sigma
centroid_arr = np.empty((n, X.shape[1]))
sigma_arr = np.empty((n,1))
dbi_arr = np.empty((n,n))
mask_arr = np.invert(np.eye(n, dtype='bool'))
for i,k in enumerate(uniqlbls):
Xk = X[np.where(y==k)[0],...]
Ak = np.mean(Xk, axis=0)
centroid_arr[i,...] = Ak
sigma_arr[i,...] = np.mean(cdist(Xk, Ak.reshape(1,-1)))
# compute pairwise centroid distances, make diagonal elements non-zero
centroid_pdist_arr = squareform(pdist(centroid_arr)) + np.eye(n)
# compute pairwise sigma sums
sigma_psum_arr = squareform(pdist(sigma_arr, lambda u,v: u+v))
# divide
dbi_arr = np.divide(sigma_psum_arr, centroid_pdist_arr)
# get mean of max of off-diagonal elements
dbi_arr = np.where(mask_arr, dbi_arr, 0)
dbi = np.mean(np.max(dbi_arr, axis=1))
return dbi
这是一个使用 numpy 1.14、scipy 1.1.0 和 python 3 的实现。计算速度提高不多,但内存占用应该略小。
import numpy as np
from scipy.spatial.distance import euclidean, cdist, pdist, squareform
def db_index(X, y):
"""
Davies-Bouldin index is an internal evaluation method for
clustering algorithms. Lower values indicate tighter clusters that
are better separated.
Arguments
----------
X : 2D array (n_samples, embed_dim)
Vector for each example.
y : 1D array (n_samples,) or 2D binary array (n_samples, n_classes)
True labels for each example.
Returns
----------
dbi : float
Calculated Davies-Bouldin index.
"""
# get unique labels
if y.ndim == 2:
y = np.argmax(axis=1)
uniqlbls = np.unique(y)
n = len(uniqlbls)
# pre-calculate centroid and sigma
centroid_arr = np.empty((n, X.shape[1]))
sigma_arr = np.empty(n)
for i,k in enumerate(uniqlbls):
Xk = X[np.where(y==k)[0],...]
Ak = np.mean(Xk, axis=0)
centroid_arr[i,...] = Ak
sigma_arr[i,...] = np.mean(cdist(Xk, Ak.reshape(1,-1)))
# loop over non-duplicate cluster pairs
dbi = 0
for i in range(n):
max_Rij = 0
for j in range(n):
if j != i:
Rij = np.divide(sigma_arr[i] + sigma_arr[j],
euclidean(centroid_arr[i,...], centroid_arr[j,...]))
if Rij > max_Rij:
max_Rij = Rij
dbi += max_Rij
return dbi/n