验证模糊聚类

Question

我想对包含 41 个变量和 415 个观测值的大型无监督数据集使用模糊 C 均值聚类。但是，我一直在尝试验证这些集群。当我用运行dom 个集群进行绘图时，我可以解释总共 54% 的方差，这不是很好，并且没有像 iris 数据库那样真正好的集群例如。

首先，我运行 fcm 和我在 3 个集群上的比例数据只是为了看看，但如果我试图找到搜索最佳集群数量的方法，那么我不会想要设置任意定义数量的集群。

所以我转向了google和googled:"valdiate fuzzy clustering in R."This link here was good，但我还是要尝试一堆不同数量的簇。我查看了 advclust、ppclust 和 clvalid 包，但找不到这些函数的演练。我查看了每个包的文档，但也无法辨别下一步该做什么。

我遍历了一些可能的集群，并用来自 fanny 的 k.crisp 对象检查了每个集群。我从 100 开始减少到 4。根据文档中的对象描述，

k.crisp=integer ( ≤ k ) giving the number of crisp clusters; can be less than k , where it's recommended to decrease memb.exp.

这似乎不是一种有效的方法，因为它是将清晰簇的数量与我们的模糊簇进行比较。

是否有可以检查 2:10 集群中集群有效性的功能？另外，是否值得检查 1 个集群的有效性？我认为这是一个愚蠢的问题，但我有一种 st运行ge 感觉 1 个最佳集群可能就是我得到的。（如果我得到 1 个群集，除了在里面哭一点，还有什么技巧吗？）

代码

library(cluster)
library(factoextra)
library(ppclust)
library(advclust)
library(clValid)
data(iris)
df<-sapply(iris[-5],scale)
res.fanny<-fanny(df,3,metric='SqEuclidean')
res.fanny$k.crisp
# When I try to use euclidean, I get the warning all memberships are very close to 1/l. Maybe increase memb.exp, which I don't fully understand
# From my understanding using the SqEuclidean is equivalent to Fuzzy C-means, use the website below. Ultimately I do want to use C-means, hence I use the SqEuclidean distance
fviz_cluster(Res.fanny,ellipse.type='norm',palette='jco',ggtheme=theme_minimal(),legend='right')
fviz_silhouette(res.fanny,palette='jco',ggtheme=theme_minimal())

# With ppclust
set.seed(123)
res.fcm<-fcm(df,centers=3,nstart=10)

website as mentioned above.

Answer 1

据我所知，您需要遍历不同数量的聚类，看看解释的方差百分比如何随着不同数量的聚类而变化。这种方法称为肘法。

wss <- sapply(2:10, 
       function(k){fcm(df,centers=k,nstart=10)$sumsqrs$tot.within.ss})

plot(2:10, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

结果图是

k = 5后，簇内总平方和趋于缓慢变化。因此，根据肘部方法，k = 5 是最佳聚类数的一个很好的候选者。

验证模糊聚类

Validating Fuzzy Clustering

validation

r

cluster-analysis