K-means 初始化进一步优先遍历和 k-mean++

K-means initialization with further-first traversal and k-mean++

我对 k-mean++ 初始化感到困惑。我理解 k-mean++ 选择和最远的数据点作为下一个数据中心。但是异常值呢？进一步优先遍历的初始化和 k-mean++ 有什么区别？

看到有人这样解释：

Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).

Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).

这个数组或列表是什么[0,1,2,4,5,6,100]。显然，在这种情况下 100 是异常值，它会在某个时候被选为数据中心。谁能给出更好的解释？

K-means 选择具有概率的点。

但是，是的，对于极端异常值，它很可能会选择异常值。

那是很好，因为 k-means 也是如此。最好的 SSQ 解决方案很可能有一个仅包含该点的单元素集群。

如果您有这样的数据，k-means 解决方案往往毫无用处，您可能应该选择其他算法，例如 DBSCAN。

K-means 初始化进一步优先遍历和 k-mean++

K-means initialization with further-first traversal and k-mean++

cluster-analysis

machine-learning

k-means