k-means 集群中可以有重叠吗？

Question

我不清楚为什么 k 均值聚类可以在聚类中重叠。从陈（2018）我看到了以下定义：

"..让观察结果成为一个样本集，将其划分为 K 个不相交的簇"

但是我发现我的情节有重叠，我不确定为什么会这样。

作为参考，我正在尝试对具有三个变量（新近度、频率、收入）的多维数据集进行聚类。为了可视化聚类，我可以使用 PCA 和运行 k-means 将 3D 数据投影到 2D。下面是我得到的代码和情节：

df1=tx_user[["Recency","Frequency","Revenue"]]
#standardize
names = df1.columns
# Create the Scaler object
scaler = preprocessing.StandardScaler()
# Fit your data on the scaler object
scaled_df1 = scaler.fit_transform(df1)
df1 = pd.DataFrame(scaled_df1, columns=names)
df1.head()
del scaled_df1

sklearn_pca = PCA(n_components = 2)
X1 = sklearn_pca.fit_transform(df1)
X1 = X1[:, ::-1] # flip axes for better plotting
kmeans = KMeans(3, random_state=0)
labels = kmeans.fit(X1).predict(X1)
plt.scatter(X1[:, 0], X1[:, 1], c=labels, s=40, cmap='viridis');

from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist

def plot_kmeans(kmeans, X, n_clusters=4, rseed=0, ax=None):
    labels = kmeans.fit_predict(X)

    # plot the input data
    ax = ax or plt.gca()
    ax.axis('equal')
    #ax.set_ylim(-5000,7000)
    ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)

    # plot the representation of the KMeans model
    centers = kmeans.cluster_centers_
    radii = [cdist(X[labels == i], [center]).max()
             for i, center in enumerate(centers)]
    for c, r in zip(centers, radii):
        ax.add_patch(plt.Circle(c, r, fc='#CCCCCC', lw=3, alpha=0.5, zorder=1))

kmeans = KMeans(n_clusters=4, random_state=0)
plot_kmeans(kmeans, X1)

我的问题是： 1、为什么会有重叠？如果有的话，我的聚类是错误的吗？ 2. 如果存在重叠，k-means 如何决定聚类分配？

谢谢

参考： Chen, L.、Xu, Z.、Wang, H. 和 Liu, S. (2018)。基于K-means和PROMETHEE方法的有序聚类算法。国际机器学习与控制论杂志，9(6)，917-926。

Answer 1

K-means 通过平均近似计算 k 个聚类。每个集群都由它们的计算中心定义，因此根据定义是唯一的。

样本分配到距离聚类中心最近的聚类，根据定义也是唯一的。因此，从这个意义上说，没有重叠。

然而，对于给定的距离 d>0，样本可能在 d 距离内到多个聚类中心（这是可能的）。这就是当您说 overlap 时看到的内容。然而，样本仍然被分配到最近的集群，而不是所有的集群。所以没有重叠。

注意：如果样本与多个聚类中心的最近距离完全相同，则可以在最近的聚类之间进行任何随机分配，这不会改变任何重要的算法或结果，因为聚类在分配后重新计算。

Answer 2

Kmeans 算法是一种迭代算法，它试图将数据集划分为 K 个预定义的不同的非重叠子组（集群），其中每个数据点仅属于一个组。它试图使集群间数据点尽可能相似，同时保持集群尽可能不同（远）。它将数据点分配给一个集群，使得数据点与集群质心（属于该集群的所有数据点的算术平均值）之间的平方距离之和最小。我们在集群内的变化越小，数据点在同一集群内就越均匀（相似）。

也许你做错了什么......我没有你的数据，所以我无法测试它。您可以添加边界并检查它们。请参阅下面的示例代码。

import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial import Voronoi

def voronoi_finite_polygons_2d(vor, radius=None):
    """
    Reconstruct infinite voronoi regions in a 2D diagram to finite
    regions.

    Parameters
    ----------
    vor : Voronoi
        Input diagram
    radius : float, optional
        Distance to 'points at infinity'.

    Returns
    -------
    regions : list of tuples
        Indices of vertices in each revised Voronoi regions.
    vertices : list of tuples
        Coordinates for revised Voronoi vertices. Same as coordinates
        of input vertices, with 'points at infinity' appended to the
        end.

    """

    if vor.points.shape[1] != 2:
        raise ValueError("Requires 2D input")

    new_regions = []
    new_vertices = vor.vertices.tolist()

    center = vor.points.mean(axis=0)
    if radius is None:
        radius = vor.points.ptp().max()*2

    # Construct a map containing all ridges for a given point
    all_ridges = {}
    for (p1, p2), (v1, v2) in zip(vor.ridge_points, vor.ridge_vertices):
        all_ridges.setdefault(p1, []).append((p2, v1, v2))
        all_ridges.setdefault(p2, []).append((p1, v1, v2))

    # Reconstruct infinite regions
    for p1, region in enumerate(vor.point_region):
        vertices = vor.regions[region]

        if all([v >= 0 for v in vertices]):
            # finite region
            new_regions.append(vertices)
            continue

        # reconstruct a non-finite region
        ridges = all_ridges[p1]
        new_region = [v for v in vertices if v >= 0]

        for p2, v1, v2 in ridges:
            if v2 < 0:
                v1, v2 = v2, v1
            if v1 >= 0:
                # finite ridge: already in the region
                continue

            # Compute the missing endpoint of an infinite ridge

            t = vor.points[p2] - vor.points[p1] # tangent
            t /= np.linalg.norm(t)
            n = np.array([-t[1], t[0]])  # normal

            midpoint = vor.points[[p1, p2]].mean(axis=0)
            direction = np.sign(np.dot(midpoint - center, n)) * n
            far_point = vor.vertices[v2] + direction * radius

            new_region.append(len(new_vertices))
            new_vertices.append(far_point.tolist())

        # sort region counterclockwise
        vs = np.asarray([new_vertices[v] for v in new_region])
        c = vs.mean(axis=0)
        angles = np.arctan2(vs[:,1] - c[1], vs[:,0] - c[0])
        new_region = np.array(new_region)[np.argsort(angles)]

        # finish
        new_regions.append(new_region.tolist())

    return new_regions, np.asarray(new_vertices)

# make up data points
np.random.seed(1234)
points = np.random.rand(15, 2)

# compute Voronoi tesselation
vor = Voronoi(points)

# plot
regions, vertices = voronoi_finite_polygons_2d(vor)
print("--")
print(regions)
print("--")
print(vertices)

# colorize
for region in regions:
    polygon = vertices[region]
    plt.fill(*zip(*polygon), alpha=0.4)

plt.plot(points[:,0], points[:,1], 'ko')
plt.axis('equal')
plt.xlim(vor.min_bound[0] - 0.1, vor.max_bound[0] + 0.1)
plt.ylim(vor.min_bound[1] - 0.1, vor.max_bound[1] + 0.1)

这里有很好的资源。

https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html

k-means 集群中可以有重叠吗？

Can there be overlap in k-means clusters?

cluster-analysis

k-means