如何衡量 R 中 K-Means 集群的性能？ [包括图像和代码]

Question

我目前正在对我公司的一些客户数据进行 K 均值聚类分析。我想衡量这个集群的性能，我只是不知道用来衡量它性能的库包，我也不确定我的集群是否分组得太近了。

为我的集群提供的数据是一个简单的 RFM（新近度、频率和货币价值）。我还包括客户每笔交易的平均订单价值。我使用肘法来确定要使用的最佳簇数。数据由 1400 个客户和 4 个指标值组成。

附件也是聚类图和 R 代码的图像

drop = c('CUST_Business_NM')

#Cleaning & Scaling the Data
new_cluster_data = na.omit(data)
new_cluster_data = data[, !(names(data)%in%drop)]
new_cluster_data = scale(new_cluster_data)
glimpse(new_cluster_data)

#Elbow Method for Optimal Clusters
k.max <- 15
data <- new_cluster_data
wss <- sapply(1:k.max, 
              function(k){kmeans(data, k, nstart=50,iter.max = 15 )$tot.withinss})
#Plot out the Elbow
wss
plot(1:k.max, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

#Create the Cluster
kmeans_test = kmeans(new_cluster_data, centers = 8, nstart = 1000)
View(kmeans_test$cluster)

#Visualize the Cluster
fviz_cluster(kmeans_test, data = new_cluster_data,  show.clust.cent = TRUE, geom = c("point", "text"))

Answer 1

您可能不想衡量 cluster 的性能，而是 cluster algorithm 的性能，在本例中为 kmeans。

首先要明确cluster distance measure要使用什么。聚类计算的结果是dissimilarity matrix，因此距离度量的选择很关键，你可以使用euclidean、manhattan、任何一种correlation或其他距离测量，例如：

library("factoextra")
dis_pearson <- get_dist(yourdataset, method = "pearson")
dis_pearson
fviz_dist(dis_pearson)

这将为您提供距离矩阵并将其可视化。

kmeans的输出有几位信息。关于你的问题最重要的是：

totss:总平方和
withinss: 簇内平方和向量
tot.withinss: 簇内总平方和
betweenss:簇间平方和

因此，目标是通过使用距离和其他方法对数据进行聚类来优化这些。使用 cluster 包，您可以通过 mycluster <- kmeans(yourdataframe, centers = 2) 简单地提取这些度量，然后调用 mycluster。

旁注：kmeans 需要用户定义的聚类数量（额外的努力）并且它对异常值非常敏感。

如何衡量 R 中 K-Means 集群的性能？ [包括图像和代码]

How to measure performance of K-Means cluster in R? [image & code included]

statistics

r

cluster-analysis

k-means