在查找 k 簇方面比 Elbow 有用的另一个函数

Question

我尝试在机器学习中为 k-means 方法找到合适的 k 聚类。我用的是Elbow方法，但是费时且复杂度高。谁能告诉我另一种方法来替换它。非常感谢

Answer 1

可用于评估聚类结果的指标是 silhouette coefficient。这个值基本上计算：

silhouette coefficient = 1 - (intra-cluster cohesion) / (inter-cluster separation)

值的范围从 -1 到 +1，但通常您希望值更接近 1.0。因此，如果您运行聚类算法（例如 k-means 或层次聚类）来生成 3 个聚类，则可以调用轮廓库来计算轮廓系数值，例如0.50。如果您运行您的算法再次产生 4 个簇，您可以计算另一个轮廓系数值，例如0.55。然后您可以得出结论，4 个聚类是更好的聚类，因为它具有更高的轮廓系数。

下面是一个示例数据集，其中我使用 R 在二维 space 中创建了三个不同的集群。注意：Real-world 集群之间如此明显的分离，数据永远不会看起来这么干净.即使像 Fisher 的 Iris 数据集这样的简单数据在标记的簇之间也有重叠。

然后您可以使用 R 的轮廓库来计算轮廓系数。（可以找到更多信息at the STHDA website。）下面是剪影信息的图表。您想要的一个指标位于 lower-left 角，即 "Average silhouette width: xxx"。该值是所有水平条的平均值。

这是 K=2 个簇的轮廓系数。

plot(silhouette(kmeans(df, centers=2)$cluster, dist(df)))

这是 K=3 个簇的轮廓系数。

plot(silhouette(kmeans(df, centers=3)$cluster, dist(df)))

这是 K=4 个簇的轮廓系数。

plot(silhouette(kmeans(df, centers=4)$cluster, dist(df)))

从轮廓系数来看，您可以得出结论，K=3 个聚类是最佳聚类，因为它具有最高的轮廓系数。

您可以通过简单地扫描多个候选 K 值（例如 2 到 10 之间），同时跟踪找到的最高轮廓系数，以编程方式找到最佳 K 值。下面我已经做到了这一点，同时还构建了剪影系数 (y-axis) 与 K (x-axis) 的关系图。输出显示：

Best Silhouette coefficient=0.888926 occurred at k=3

library(cluster) # for silhouette
library(ggplot2) # for ggplot
library(scales) # for pretty_breaks


# Create sample 2-D data set with clusters around the points (1,1), (2,4), and (3,1)
x<- c(rnorm(n=25, mean=1,sd=.1), rnorm(n=25,mean=2,sd=.1),rnorm(n=25,mean=3,sd=.2))
y<- c(rnorm(n=25, mean=1,sd=.1), rnorm(n=25,mean=4,sd=.1),rnorm(n=25,mean=1,sd=.2))

df <- data.frame(x=x, y=y)

xMax <- max(x)
yMax <- max(y)
print(ggplot(df, aes(x,y)) + geom_point() + xlim(0, max(xMax, yMax)) + ylim(0, max(xMax,yMax)))


# Use the Iris data set.
#df <- subset(iris, select=-c(Species))
#df <- scale(df)


# Run through multiple candidate values of K clusters.

xValues <- c() # Holds the kvalues (x-axis)
yValues <- c() # Holds the silhouette coefficient values (y-axis)
bestKValue <- 0
bestSilhouetteCoefficient <- 0

kSequence <- seq(2, 5)

for (kValue in kSequence) {

    xValues <- append(xValues, kValue)
    kmeansResult <- kmeans(df, centers=kValue, nstart=5)
    silhouetteResult <- silhouette(kmeansResult$cluster, dist(df))
    silhouetteCoefficient <- mean(silhouetteResult[,3])
    yValues <- append(yValues, silhouetteCoefficient)

    if (silhouetteCoefficient > bestSilhouetteCoefficient) {
        bestSilhouetteCoefficient <- silhouetteCoefficient
        bestKValue <- kValue
    }
}

# Create a dataframe for ggplot to plot the accumulated silhouette values.
dfSilhouette <- data.frame(k=xValues, silhouetteCoefficient=yValues)

# Create the ggplot line plot for silhouette coefficient.
silhouettePlot<- ggplot(data=dfSilhouette, aes(k)) +
    geom_line(aes(y=silhouetteCoefficient)) +
    xlab("k") +
    ylab("Average silhouette width") +
    ggtitle("Average silhouette width") +
    scale_x_continuous(breaks=pretty_breaks(n=20)) 

print(silhouettePlot)

printf <- function(...) cat(sprintf(...))
printf("Best Silhouette coefficient=%f occurred at k=%d", bestSilhouetteCoefficient, bestKValue )

请注意，我使用了答案 here 中的 printf 函数。

与您相关的问题是 here。

在查找 k 簇方面比 Elbow 有用的另一个函数

Another function useful than Elbow in finding k-clusters

machine-learning

data-mining

k-means