聚类算法的准确性是多少？

Question

我有一组使用聚类算法（在本例中为 k-means）聚类的点。我也知道真实标签，我想衡量我的聚类的准确性。我需要的是找到实际的准确性。当然，问题在于聚类给出的标签与原始标签的顺序不匹配。

有没有办法衡量这个准确度？直观的想法是计算每个标签组合的混淆矩阵的分数，并且只保留最大值。有没有这样做的功能？

我还使用随机分数和调整随机分数评估了我的结果。这两项措施与实际准确性有多接近？

谢谢！

Answer 1

您可以使用 sklearn.metrics.accuracy，如下面 link 中所述

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

示例见下文link

Answer 2

首先，The problem, of course, is that the labels given by the clustering do not match the ordering of the original one.是什么意思？

如果你知道真实标签，那么你可以重新排列它们以匹配 X 矩阵的顺序，这样，Kmeans 标签将与预测后的真实标签一致。

遇到这种情况，我建议如下。

如果您有 ground truth labels 并且您想看看您的模型有多准确，那么您需要 Rand 指数或预测值与真实值之间的互信息等指标标签。您可以在交叉验证方案中执行此操作，并查看模型的行为方式，即它是否可以在交叉验证方案下正确预测 classes/labels。可以使用兰德指数等指标来计算预测优度的评估。

总结：

定义一个 Kmeans 模型并使用交叉验证，并在每次迭代中估计分配和 真实标签之间的兰德指数（或互信息）。对所有迭代重复该操作，最后取兰德指数分数的平均值。如果这个分数很高，那么这个模型是好的。

完整示例：

from sklearn.cluster import KMeans
from sklearn.metrics.cluster import adjusted_rand_score
from sklearn.datasets import load_iris
from sklearn.model_selection import LeaveOneOut
import numpy as np

# some data
data = load_iris()
X = data.data
y = data.target # ground truth labels
loo = LeaveOneOut()

rand_index_scores = []
for train_index, test_index in loo.split(X): # LOOCV here
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

   # the model
   kmeans = KMeans(n_clusters=3, random_state=0)
   kmeans.fit(X_train) # fit using training data
   predicted_labels = kmeans.predict(X_test) # predict using test data
   rand_index_scores.append(adjusted_rand_score(y_test, predicted_labels)) # calculate goodness of predicted labels

print(np.mean(rand_index_scores))

Answer 3

由于聚类是一个无监督学习问题，您有具体的指标：https://scikit-learn.org/stable/modules/classes.html#clustering-metrics

您可以参考 scikit-learn 用户指南中的讨论，了解不同聚类指标之间的差异：https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

例如，调整后的 Rand 指数将比较一对点，并检查如果标签在 ground-truth 中相同，则在预测中也会相同。与准确性不同，您不能使标签严格相等。

聚类算法的准确性是多少？

What is the accuracy of a clustering algorithm?

cluster-computing

scikit-learn