数据点与其聚类中心的平均偏差随每次迭代而变化
Average deviation of data points from their cluster center changes with each iteration
我的数据集可以在 kaggle https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python 中找到。所以我在我的数据集上使用 运行ning k-means,它有 4 列和 200 行,k = 5。我想找到集群半径,所以我测量了每个数据点与其各自集群中心的平均距离但是每当我重新 运行 我的程序时,它们的值就会改变。我的集群中心不会随着每次迭代而改变,所以到底发生了什么?我该如何解决这个问题?
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import euclidean_distances
from sklearn.preprocessing import StandardScaler
import numpy as np
import scipy.spatial.distance as sdist
df = pd.read_csv('D:\Mall_Customers.csv', usecols = ['Spending Score (1-100)', 'Annual Income (k$)'])
x = StandardScaler().fit_transform(df)
kmeans = KMeans(n_clusters=5, max_iter=100, random_state=0)
y_kmeans= kmeans.fit_predict(x)
centroids = kmeans.cluster_centers_
print(centroids)
df["cluster"] = kmeans.labels_
n_clusters = 5
clusters = [x[y_kmeans == i] for i in range(n_clusters)]
for i, c in enumerate(clusters):
print('Cluster {} has {} observations: {}...'.format(i, len(c), c[0]))
df["cluster"] = kmeans.labels_
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(df)
#cluster radius
def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
distances = [np.sqrt((x - cx) ** 2 + (y - cy) ** 2) for (x, y) in data[cluster_labels == i_centroid]]
return np.mean(distances)
t_data = PCA(n_components=2).fit_transform(x)
k_means = KMeans()
clusters = k_means.fit_predict(t_data)
centroids = kmeans.cluster_centers_
c_mean_distances = []
for i, (cx, cy) in enumerate(centroids):
mean_distance = k_mean_distance(t_data, cx, cy, i, clusters)
c_mean_distances.append(mean_distance)
print("mean distances are", c_mean_distances)
输出 1 [1.5381892556224435, 1.796763983963032, 1.5144402423920744, 3.4372440532366753, 1.6533031213582314]
迭代 2```[3.180393284279158, 2.809194267986748, 0.7823704675079582, 3.4929008204149365, 1.8109097594336663]
迭代 3 [1.9461073260609538, 3.2032294269352155, 2.447917517713439, 3.4372440532366753, 2.197239028470577]
我将添加答案以记录问题。
首先,当您进行低维嵌入时,请确保它不需要 运行dom 种子来确保可重复性。在这种情况下 (PCA) 我认为没问题,但其他低维嵌入可能会有所不同。
其次,KMeans 并不总是收敛到全局最优值,因此可能具有不同的收敛簇。为了保持 KMeans 的可重复性,Scikit Learn 具有 random_state
输入参数。
你第一次设置这个 运行 KMeans。这使您的代码的第一部分可重复。为了确保 PCA 嵌入后聚类的可重复性,以相同的方式设置 运行dom 状态:
k_means = KMeans(n_clusters=5, max_iter=100, random_state=0)
我的数据集可以在 kaggle https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python 中找到。所以我在我的数据集上使用 运行ning k-means,它有 4 列和 200 行,k = 5。我想找到集群半径,所以我测量了每个数据点与其各自集群中心的平均距离但是每当我重新 运行 我的程序时,它们的值就会改变。我的集群中心不会随着每次迭代而改变,所以到底发生了什么?我该如何解决这个问题?
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import euclidean_distances
from sklearn.preprocessing import StandardScaler
import numpy as np
import scipy.spatial.distance as sdist
df = pd.read_csv('D:\Mall_Customers.csv', usecols = ['Spending Score (1-100)', 'Annual Income (k$)'])
x = StandardScaler().fit_transform(df)
kmeans = KMeans(n_clusters=5, max_iter=100, random_state=0)
y_kmeans= kmeans.fit_predict(x)
centroids = kmeans.cluster_centers_
print(centroids)
df["cluster"] = kmeans.labels_
n_clusters = 5
clusters = [x[y_kmeans == i] for i in range(n_clusters)]
for i, c in enumerate(clusters):
print('Cluster {} has {} observations: {}...'.format(i, len(c), c[0]))
df["cluster"] = kmeans.labels_
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(df)
#cluster radius
def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
distances = [np.sqrt((x - cx) ** 2 + (y - cy) ** 2) for (x, y) in data[cluster_labels == i_centroid]]
return np.mean(distances)
t_data = PCA(n_components=2).fit_transform(x)
k_means = KMeans()
clusters = k_means.fit_predict(t_data)
centroids = kmeans.cluster_centers_
c_mean_distances = []
for i, (cx, cy) in enumerate(centroids):
mean_distance = k_mean_distance(t_data, cx, cy, i, clusters)
c_mean_distances.append(mean_distance)
print("mean distances are", c_mean_distances)
输出 1 [1.5381892556224435, 1.796763983963032, 1.5144402423920744, 3.4372440532366753, 1.6533031213582314]
迭代 2```[3.180393284279158, 2.809194267986748, 0.7823704675079582, 3.4929008204149365, 1.8109097594336663]
迭代 3 [1.9461073260609538, 3.2032294269352155, 2.447917517713439, 3.4372440532366753, 2.197239028470577]
我将添加答案以记录问题。
首先,当您进行低维嵌入时,请确保它不需要 运行dom 种子来确保可重复性。在这种情况下 (PCA) 我认为没问题,但其他低维嵌入可能会有所不同。
其次,KMeans 并不总是收敛到全局最优值,因此可能具有不同的收敛簇。为了保持 KMeans 的可重复性,Scikit Learn 具有 random_state
输入参数。
你第一次设置这个 运行 KMeans。这使您的代码的第一部分可重复。为了确保 PCA 嵌入后聚类的可重复性,以相同的方式设置 运行dom 状态:
k_means = KMeans(n_clusters=5, max_iter=100, random_state=0)