如何自动化集群的数量?
How do I automate the number of clusters?
Edit: I accept that my question has been closed for being similar but I think the answers have provided valuable knowledge for others so this should be open.
我一直在玩下面的脚本:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import textract
import os
folder_to_scan = '/media/sf_Documents/clustering'
dict_of_docs = {}
# Gets all the files to scan with textract
for root, sub, files in os.walk(folder_to_scan):
for file in files:
full_path = os.path.join(root, file)
print(f'Processing {file}')
try:
text = textract.process(full_path)
dict_of_docs[file] = text
except Exception as e:
print(e)
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(dict_of_docs.values())
true_k = 3
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i,)
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind],)
它扫描一个包含扫描文档图像的文件夹,提取文本然后对文本进行聚类。我知道事实上有 3 种不同类型的文档,所以我将 true_k 设置为 3。但是如果我有一个未知文档文件夹,其中可能有 1 到 100 种不同文档类型。
这是一个不稳定的领域,因为很难衡量 "good" 您的聚类算法在没有任何基本事实标签的情况下如何工作。为了进行自动选择,您需要有一个指标来比较 KMeans
对 n_clusters
.
的不同值的执行情况。
一个流行的选择是剪影得分。您可以找到有关它的更多详细信息 here。这是 scikit-learn
文档:
The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1.
因此,您只能计算 n_clusters >= 2
的剪影分数(不幸的是,鉴于您的问题描述,这可能是您的限制)。
这就是你在虚拟数据集上使用它的方式(你可以根据你的代码调整它,只是为了有一个可重现的例子):
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
iris = load_iris()
X = iris.data
sil_score_max = -1 #this is the minimum possible score
for n_clusters in range(2,10):
model = KMeans(n_clusters = n_clusters, init='k-means++', max_iter=100, n_init=1)
labels = model.fit_predict(X)
sil_score = silhouette_score(X, labels)
print("The average silhouette score for %i clusters is %0.2f" %(n_clusters,sil_score))
if sil_score > sil_score_max:
sil_score_max = sil_score
best_n_clusters = n_clusters
这将 return:
The average silhouette score for 2 clusters is 0.68
The average silhouette score for 3 clusters is 0.55
The average silhouette score for 4 clusters is 0.50
The average silhouette score for 5 clusters is 0.49
The average silhouette score for 6 clusters is 0.36
The average silhouette score for 7 clusters is 0.46
The average silhouette score for 8 clusters is 0.34
The average silhouette score for 9 clusters is 0.31
因此你将有 best_n_clusters = 2
(注意:实际上,Iris 有三个 类...)
Edit: I accept that my question has been closed for being similar but I think the answers have provided valuable knowledge for others so this should be open.
我一直在玩下面的脚本:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import textract
import os
folder_to_scan = '/media/sf_Documents/clustering'
dict_of_docs = {}
# Gets all the files to scan with textract
for root, sub, files in os.walk(folder_to_scan):
for file in files:
full_path = os.path.join(root, file)
print(f'Processing {file}')
try:
text = textract.process(full_path)
dict_of_docs[file] = text
except Exception as e:
print(e)
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(dict_of_docs.values())
true_k = 3
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i,)
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind],)
它扫描一个包含扫描文档图像的文件夹,提取文本然后对文本进行聚类。我知道事实上有 3 种不同类型的文档,所以我将 true_k 设置为 3。但是如果我有一个未知文档文件夹,其中可能有 1 到 100 种不同文档类型。
这是一个不稳定的领域,因为很难衡量 "good" 您的聚类算法在没有任何基本事实标签的情况下如何工作。为了进行自动选择,您需要有一个指标来比较 KMeans
对 n_clusters
.
一个流行的选择是剪影得分。您可以找到有关它的更多详细信息 here。这是 scikit-learn
文档:
The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1.
因此,您只能计算 n_clusters >= 2
的剪影分数(不幸的是,鉴于您的问题描述,这可能是您的限制)。
这就是你在虚拟数据集上使用它的方式(你可以根据你的代码调整它,只是为了有一个可重现的例子):
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
iris = load_iris()
X = iris.data
sil_score_max = -1 #this is the minimum possible score
for n_clusters in range(2,10):
model = KMeans(n_clusters = n_clusters, init='k-means++', max_iter=100, n_init=1)
labels = model.fit_predict(X)
sil_score = silhouette_score(X, labels)
print("The average silhouette score for %i clusters is %0.2f" %(n_clusters,sil_score))
if sil_score > sil_score_max:
sil_score_max = sil_score
best_n_clusters = n_clusters
这将 return:
The average silhouette score for 2 clusters is 0.68
The average silhouette score for 3 clusters is 0.55
The average silhouette score for 4 clusters is 0.50
The average silhouette score for 5 clusters is 0.49
The average silhouette score for 6 clusters is 0.36
The average silhouette score for 7 clusters is 0.46
The average silhouette score for 8 clusters is 0.34
The average silhouette score for 9 clusters is 0.31
因此你将有 best_n_clusters = 2
(注意:实际上,Iris 有三个 类...)