对于 kmeans 散点图,PCA 输出看起来很奇怪

PCA output looks weird for a kmeans scatter plot

在对我的数据进行 PCA 并绘制 kmeans 聚类后,我的图看起来真的很奇怪。簇的中心和点的散点图对我来说没有意义。这是我的代码:

#clicks, conversion, bounce and search are lists of values.
clicks=[2,0,0,8,7,...]
conversion = [1,0,0,6,0...]
bounce = [2,4,5,0,1....]

X = np.array([clicks,conversion, bounce]).T
y = np.array(search)

num_clusters = 5

pca=PCA(n_components=2, whiten=True)
data2D = pca.fit_transform(X)

print data2D
    >>> [[-0.07187948 -0.17784291]
     [-0.07173769 -0.26868727]
     [-0.07173789 -0.26867958]
     ..., 
     [-0.06942414 -0.25040886]
     [-0.06950897 -0.19591147]
     [-0.07172973 -0.2687937 ]]

km = KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit_transform(X)

labels=km.labels_
centers2D = pca.fit_transform(km.cluster_centers_)

colors=['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]

plt.scatter( data2D[:,0], data2D[:,1], c=label_color)
plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1],  marker='x', c='r')
plt.show()

红色十字是集群的中心。任何帮助都会很棒。

您对 PCA 和 KMeans 的排序搞砸了...

这是您需要做的:

  1. 标准化您的数据。
  2. X 上执行 PCA 以将维度从 5 减少到 2 并生成 Data2D
  3. 再次正常化
  4. 群集 Data2DKMeans
  5. Data2D 上绘制 Centroids

这里是你上面所做的:

  1. X 上执行 PCA 以将维度从 5 减少到 2 以生成 Data2D
  2. 在 5 个维度上对原始数据 X 进行聚类。
  3. 对您的簇质心执行单独的 PCA,这会为质心生成完全不同的 2D 子空间。
  4. 绘制 PCA 缩减 Data2D,顶部有 PCA 个缩减质心,即使它们不再正确耦合。

归一化:

看看下面的代码,您会发现它把质心放在了它们需要的位置。标准化是关键并且是完全可逆的。聚类时始终规范化数据,因为距离指标需要在所有空间中均等地移动。聚类是规范化数据的最重要时期之一,但总的来说......总是规范化:-)

超出您最初问题的启发式讨论:

降维的全部目的是使 KMeans 聚类更容易,并投射出不会增加数据方差的维度。所以你应该将减少的数据传递给你的聚类算法。我要补充一点,很少有 5D 数据集可以向下投影到 2D 而不会丢掉很多方差,即查看 PCA 诊断以查看是否保留了 90% 的原始方差。如果不是,那么您可能不想在 PCA 中如此激进。

新代码:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import seaborn as sns
%matplotlib inline

# read your data, replace 'Whosebug.csv' with your file path
df = pd.read_csv('/Users/angus/Desktop/Downloads/Whosebug.csv', usecols[0, 2, 4],names=['freq', 'visit_length', 'conversion_cnt'],header=0).dropna()

df.describe()

#Normalize the data
df_norm = (df - df.mean()) / (df.max() - df.min())

num_clusters = 5

pca=PCA(n_components=2)
UnNormdata2D = pca.fit_transform(df_norm)

# Check the resulting varience
var = pca.explained_variance_ratio_
print "Varience after PCA: ",var

#Normalize again following PCA: data2D
data2D = (UnNormdata2D - UnNormdata2D.mean()) / (UnNormdata2D.max()-UnNormdata2D.min())

print "Data2D: "
print data2D

km = KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit_transform(data2D)

labels=km.labels_
centers2D = km.cluster_centers_

colors=['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]

plt.scatter( data2D[:,0], data2D[:,1], c=label_color)
plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1],marker='x',s=150.0,color='purple')
plt.show()

剧情:

输出:

Varience after PCA:  [ 0.65725709  0.29875307]
Data2D: 
[[-0.00338421 -0.0009403 ]
[-0.00512081 -0.00095038]
[-0.00512081 -0.00095038]
..., 
[-0.00477349 -0.00094836]
[-0.00373153 -0.00094232]
[-0.00512081 -0.00095038]]
Initialization complete
Iteration  0, inertia 51.225
Iteration  1, inertia 38.597
Iteration  2, inertia 36.837
...
...
Converged at iteration 31

希望对您有所帮助!

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# read your data, replace 'Whosebug.csv' with your file path
df = pd.read_csv('Whosebug.csv', usecols=[0, 2, 4], names=['freq', 'visit_length', 'conversion_cnt'], header=0).dropna()
df.describe()

Out[3]: 
              freq  visit_length  conversion_cnt
count  289705.0000   289705.0000     289705.0000
mean        0.2624       20.7598          0.0748
std         0.4399       55.0571          0.2631
min         0.0000        1.0000          0.0000
25%         0.0000        6.0000          0.0000
50%         0.0000       10.0000          0.0000
75%         1.0000       21.0000          0.0000
max         1.0000     2500.0000          1.0000

# binarlize freq and conversion_cnt
df.freq = np.where(df.freq > 1.0, 1, 0)
df.conversion_cnt = np.where(df.conversion_cnt > 0.0, 1, 0)

feature_names = df.columns
X_raw = df.values

transformer = PCA(n_components=2)
X_2d = transformer.fit_transform(X_raw)
# over 99.9% variance captured by 2d data
transformer.explained_variance_ratio_

Out[4]: array([  9.9991e-01,   6.6411e-05])

# do clustering
estimator = KMeans(n_clusters=5, init='k-means++', n_init=10, verbose=1)
estimator.fit(X_2d)

labels = estimator.labels_
colors = ['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]

fig, ax = plt.subplots()
ax.scatter(X_2d[:,0], X_2d[:,1], c=label_color)
ax.scatter(estimator.cluster_centers_[:,0], estimator.cluster_centers_[:,1], marker='x', s=50, c='r')

KMeans 尝试最小化组内欧氏距离,这可能适合也可能不适合您的数据。仅根据图表,我会考虑 Gaussian Mixture Model 进行无监督聚类。

此外,如果您对哪些观察可以归类到哪些 category/label 有更深入的了解,您可以进行半监督学习。