为一维数据绘制 KMeans 聚类和分类

Question

我正在使用 KMeans 对具有不同特征的三个时间序列数据集进行聚类。出于可重复性的原因，我共享数据 here。

这是我的代码

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

protocols = {}

types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T
    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    protocols[protname] = {
        "col_time": col_time,
        "col_window": col_window,
        "quotient_times": quotient_times,
        "quotient": quotient,
    }



k_means = KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)
k_means.fit(quotient.reshape(-1,1))

这样，给定一个新数据点（quotient 和 quotient_times），我想通过构建堆叠这两个转换特征的每个数据集来知道它属于哪个 cluster quotient 和 quotient_times 与 KMeans。

k_means.labels_ 给出此输出 array([1, 1, 0, 1, 2, 1, 0, 0, 2, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0], dtype=int32)

最后，我想使用 plt.plot(k_means, ".",color="blue") 可视化集群，但出现此错误：TypeError: float() argument must be a string or a number, not 'KMeans'。我们如何绘制 KMeans 个簇？

Answer 1

如果我理解正确的话，你想要绘制的是你的 Kmeans 结果的边界决定。您可以在 scikit-lean 网站 here.

中找到如何执行此操作的示例

上面的例子甚至进行了 PCA，所以数据可以在 2D 中可视化（如果你的数据维度高于 2），这对你来说是无关紧要的。

您可以通过 Kmeans 决策轻松绘制散点颜色，以便更好地了解聚类出错的地方。

Answer 2

您有效地寻找的是一系列值，在这些值之间的点被认为在给定的 class 中。使用 KMeans 以这种方式 class 化 1d 数据是很不寻常的，尽管它确实有效。正如您所注意到的，您需要将输入数据转换为二维数组才能使用该方法。

k_means = KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)

quotient_2d = quotient.reshape(-1,1)
k_means.fit(quotient_2d)

稍后 class化（预测）步骤您将再次需要 quotient_2d。

首先我们可以绘制质心，因为数据是 1d，所以 x 轴点是任意的。

colors = ['r','g','b']
centroids = k_means.cluster_centers_
for n, y in enumerate(centroids):
    plt.plot(1, y, marker='x', color=colors[n], ms=10)
plt.title('Kmeans cluster centroids')

这会产生以下情节。

要获取点的集群成员资格，请将 quotient_2d 传递给 .predict。 returns class 成员的一组数字，例如

>>> Z = k_means.predict(quotient_2d)
>>> Z
array([1, 1, 0, 1, 2, 1, 0, 0, 2, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0], dtype=int32)

我们可以用它来过滤我们的原始数据，用不同的颜色绘制每个 class。

# Plot each class as a separate colour
n_clusters = 3 
for n in range(n_clusters):
    # Filter data points to plot each in turn.
    ys = quotient[ Z==n ]
    xs = quotient_times[ Z==n ]

    plt.scatter(xs, ys, color=colors[n])

plt.title("Points by cluster")

这将使用原始数据生成以下图，每个点都由聚类成员着色。

为一维数据绘制 KMeans 聚类和分类

Plot KMeans clusters and classification for 1-dimensional data

python

machine-learning

matplotlib

k-means

scikit-learn