使用 k 均值聚类绘制的奇怪图形

Question

我有一个数据框，其中包含数百部电影的详细信息。我使用电影的详细信息（例如租金和长度）来进行 k 均值聚类。当我绘制 k-means 聚类图时，该图只是三个垂直条。是因为属性之间有关联吗？有人可以更详细地解释吗？谢谢！！

le= LabelEncoder()


#factors such as release_year, rental rate etc
#Do k_means clustering based on the factors.
factors_attributes=homework_film[['rental_rate','length','language_id']]

# Label encoding: transform string into numbers
#factors_attributes['rating'] =le.fit_transform(factors_attributes['rating'])


#The code below finds the optimal K for clustering
#The graph shows that the optimal K is 3 for this model
Sum_of_squared_distances=[]
Sum=[]
K=range(1,15)
for k in K:
    clustering=KMeans(n_clusters=k)
    clustering=clustering.fit(factors_attributes)
    Sum_of_squared_distances.append(clustering.inertia_)

plt.subplot(2,1,1)
plt.plot(K,Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method for Optimal K')
plt.show()

# The code below finds the best iteration for clustering
# The graph shows that iteration is about 9

I=range(1,50)
for i in I:
    clustering=KMeans(n_clusters=3, max_iter=i)
    clustering=clustering.fit(factors_attributes)
    Sum.append(clustering.inertia_)

plt.subplot(2,2,1)
plt.plot(I,Sum, 'bx-')
plt.xlabel('I')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method for optimal I')
plt.subplot
plt.show()



colorMap=np.array(['red','lime','black'])


plt.subplot(2,2,2)
finalC=KMeans(n_clusters=3, max_iter=9)
finalC=finalC.fit(factors_attributes)
plt.scatter(x=factors_attributes.length,y=factors_attributes.rental_rate,c=colorMap[finalC.labels_],s=50)

(factors_attributes.rental_rate,y=factors_attributes.length,c=colorMap[predicts.labels_],s=50)

plt.tight_layout()

Answer 1

在您的情节中，电影 rental_rate 是 y-axis。在您的 data-sample 中，我只能看到两个不同的值（0.99 和 4.99），这是两个水平条（顶部和底部）。大概还有rental_rates，值为2.99 --> 中间单杠。所以 a rental_rate.

只有三个不同的值

你的x-axis是电影length，它似乎是一个连续变量，范围在~45和200之间。结合language_id和rental_rate 您在 k-means 中使用这些树特征并强制模型具有 n_clusters=3 簇。现在 k-means 试图将数据分成三个簇（红色、黑色、绿色），但似乎 length 变量的影响最大，因为簇仅被这个簇分开。 rental_rate 没有（显着）影响，也可能 language_id 似乎对模型没有贡献。

我想你期望的是，这些电影是由 rental_rate 组成的集群，或者至少不是 length 单独组成的集群。对于您的数据，情况并非如此，因为 k-means 使用距离度量（默认情况下为欧氏距离）作为优化 objective，因此 "absolute values" 的特征非常重要。因此，由于与其他特征（[1,5] 和 [1,X]）相比，特征 length 的绝对值 ~[45,200] 的范围要宽得多，因此它对聚类，当计算两个样本之间的欧氏距离时。

一个可能的解决方案是规范化您的 data/features。

使用 k 均值聚类绘制的奇怪图形

Strange graph plotted using k means clustering

python

machine-learning

k-means

jupyter-notebook