簇内相似度 Kmeans
Within Cluster Similarity Kmeans
我正在尝试在 sklearn python 中使用 kmeans 对二维用户数据进行聚类。我使用肘部方法(簇号的增加不会导致平方误差和显着下降的点)来识别正确的号。簇数为 50。
Post 应用 kmeans,我想了解每个集群内数据点的相似性。因为我有 50 个集群,有没有办法得到一个数字(类似于每个集群中的方差),这可以帮助我了解每个集群中的数据点有多接近。像 0.8 这样的数字意味着记录在每个集群中具有高方差,而 0.2 意味着它们接近 "related".
总而言之,有没有什么方法可以得到一个数字来识别 kmeans 中的每个簇 "good" 是怎样的?我们可以说好是相对的,但让我们考虑一下,我对集群内方差更感兴趣,以确定特定集群有多好。
使用从 https://plot.ly/scikit-learn/plot-kmeans-silhouette-analysis/
中获取的剪影得分的代码示例
from __future__ import print_function
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
# Generating the sample data from make_blobs
# This particular setting has one distinct cluster and 3 clusters placed close
# together.
X, y = make_blobs(n_samples=500,
n_features=2,
centers=4,
cluster_std=1,
center_box=(-10.0, 10.0),
shuffle=True,
random_state=1) # For reproducibility
range_n_clusters = [2, 3, 4, 5, 6]
for n_clusters in range_n_clusters:
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)
print(cluster_labels)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X, cluster_labels)
我正在尝试在 sklearn python 中使用 kmeans 对二维用户数据进行聚类。我使用肘部方法(簇号的增加不会导致平方误差和显着下降的点)来识别正确的号。簇数为 50。
Post 应用 kmeans,我想了解每个集群内数据点的相似性。因为我有 50 个集群,有没有办法得到一个数字(类似于每个集群中的方差),这可以帮助我了解每个集群中的数据点有多接近。像 0.8 这样的数字意味着记录在每个集群中具有高方差,而 0.2 意味着它们接近 "related".
总而言之,有没有什么方法可以得到一个数字来识别 kmeans 中的每个簇 "good" 是怎样的?我们可以说好是相对的,但让我们考虑一下,我对集群内方差更感兴趣,以确定特定集群有多好。
使用从 https://plot.ly/scikit-learn/plot-kmeans-silhouette-analysis/
中获取的剪影得分的代码示例from __future__ import print_function
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
# Generating the sample data from make_blobs
# This particular setting has one distinct cluster and 3 clusters placed close
# together.
X, y = make_blobs(n_samples=500,
n_features=2,
centers=4,
cluster_std=1,
center_box=(-10.0, 10.0),
shuffle=True,
random_state=1) # For reproducibility
range_n_clusters = [2, 3, 4, 5, 6]
for n_clusters in range_n_clusters:
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)
print(cluster_labels)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X, cluster_labels)