获取 nltk 的惯性 k 意味着使用 cosine_similarity 聚类
Get inertia for nltk k means clustering using cosine_similarity
我已将 nltk 用于 k 均值聚类,因为我想更改距离度量。 nltk k means 是否具有类似于 sklearn 的惯性?似乎无法在他们的文档或网上找到...
下面的代码是人们通常如何使用 sklearn k 均值找到惯性。
inertia = []
for n_clusters in range(2, 26, 1):
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict(features)
centers = clusterer.cluster_centers_
inertia.append(clusterer.inertia_)
plt.plot([i for i in range(2,26,1)], inertia, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
你可以自己写一个函数来获取nltk中Kmeanscluster的惯量。
根据您发布的问题,。使用相同的虚拟数据,看起来像这样。制作2个集群后..
参考文档https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html,惯性是样本到最近的聚类中心的平方距离之和。
feature_matrix = df[['feature1','feature2','feature3']].to_numpy()
centroid = df['centroid'].to_numpy()
def nltk_inertia(feature_matrix, centroid):
sum_ = []
for i in range(feature_matrix.shape[0]):
sum_.append(np.sum((feature_matrix[i] - centroid[i])**2)) #here implementing inertia as given in the docs of scikit i.e sum of squared distance..
return sum(sum_)
nltk_inertia(feature_matrix, centroid)
#op 27.495250000000002
#now using kmeans clustering for feature1, feature2, and feature 3 with same number of cluster 2
scikit_kmeans = KMeans(n_clusters= 2)
scikit_kmeans.fit(vectors) # vectors = [np.array(f) for f in df.values] which contain feature1, feature2, feature3
scikit_kmeans.inertia_
#op
27.495250000000006
之前的评论其实漏了一个小细节:
feature_matrix = df[['feature1','feature2','feature3']].to_numpy()
centroid = df['centroid'].to_numpy()
cluster = df['predicted_cluster'].to_numpy()
def nltk_inertia(feature_matrix, centroid):
sum_ = []
for i in range(feature_matrix.shape[0]):
sum_.append(np.sum((feature_matrix[i] - centroid[cluster[i]])**2))
return sum(sum_)
计算质心与数据点之间的距离时,必须select对应的簇质心。注意上面代码中的 cluster 变量。
我已将 nltk 用于 k 均值聚类,因为我想更改距离度量。 nltk k means 是否具有类似于 sklearn 的惯性?似乎无法在他们的文档或网上找到...
下面的代码是人们通常如何使用 sklearn k 均值找到惯性。
inertia = []
for n_clusters in range(2, 26, 1):
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict(features)
centers = clusterer.cluster_centers_
inertia.append(clusterer.inertia_)
plt.plot([i for i in range(2,26,1)], inertia, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
你可以自己写一个函数来获取nltk中Kmeanscluster的惯量。
根据您发布的问题,
参考文档https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html,惯性是样本到最近的聚类中心的平方距离之和。
feature_matrix = df[['feature1','feature2','feature3']].to_numpy()
centroid = df['centroid'].to_numpy()
def nltk_inertia(feature_matrix, centroid):
sum_ = []
for i in range(feature_matrix.shape[0]):
sum_.append(np.sum((feature_matrix[i] - centroid[i])**2)) #here implementing inertia as given in the docs of scikit i.e sum of squared distance..
return sum(sum_)
nltk_inertia(feature_matrix, centroid)
#op 27.495250000000002
#now using kmeans clustering for feature1, feature2, and feature 3 with same number of cluster 2
scikit_kmeans = KMeans(n_clusters= 2)
scikit_kmeans.fit(vectors) # vectors = [np.array(f) for f in df.values] which contain feature1, feature2, feature3
scikit_kmeans.inertia_
#op
27.495250000000006
之前的评论其实漏了一个小细节:
feature_matrix = df[['feature1','feature2','feature3']].to_numpy()
centroid = df['centroid'].to_numpy()
cluster = df['predicted_cluster'].to_numpy()
def nltk_inertia(feature_matrix, centroid):
sum_ = []
for i in range(feature_matrix.shape[0]):
sum_.append(np.sum((feature_matrix[i] - centroid[cluster[i]])**2))
return sum(sum_)
计算质心与数据点之间的距离时,必须select对应的簇质心。注意上面代码中的 cluster 变量。