R中聚类中心和异常值之间的距离

Question

我已经使用 R（基于 kmeans）构建了一个聚类模型，并希望通过找到离群值与聚类中心之间的最小距离来对离群值进行分类。我想使用的数据框如下所示：

DF_OUTLIERS

[Product]  [Sales] [Usage]
1   100 1000   
2   200 2000  
3   300 3000  
4   200 4000   
5   100 5000

DF_CLUSTER

[Cluster] [Center_Sales] [Center_Usage]
1    120        1500  
2    220        2400 
3    150        3900    
4    140        4900

目标 table 应如下所示：

[Product]   [Sales]     [Usage]     [Cluster] 
1       100     1000        ???
2       200     2000        ???
3       300     3000        ???
4       200     4000        ???
5       100     5000        ???

要计算距离，我想使用欧氏距离的标准公式：

sqrt((Sales -  Center_Sales)^2 + (Usage -  Center_Usage)^2))

我最大的问题是开发一个函数，它可以为每一行找到所有集群中的最小值，而不需要为每个集群添加一个新列到目标 df。我想对于一个有经验的程序员来说，这是一件容易的事，但我是 R 的绝对初学者，不知道如何解决这个问题。

Answer 1

有一个方便的 which.min 函数在这种情况下很有用。

outliers<-read.table(header=TRUE, text="Product  Sales Usage
1   100 1000   
2   200 2000  
3   300 3000  
4   200 4000   
5   100 5000")

clusters<-read.table(header=TRUE, text="Cluster Center_Sales Center_Usage
1    120        1500  
2    220        2400 
3    150        3900    
4    140        4900")

answer<-sapply(1:nrow(outliers), function(x) {
  #find the distance for the outlier to every cluster
  distance<-sqrt((outliers$Sales[x] -  clusters$Center_Sales)^2 + 
                   (outliers$Usage[x] -  clusters$Center_Usage)^2)
  #find the index of the shortest distance and return
  which.min(distance)
})

answer
#[1] 1 2 2 3 4
outliers$cluster<-answer

只要异常值和聚类的数量合理，应该会有不错的性能。

R中聚类中心和异常值之间的距离

Distance between cluster center and outliers in R

r

cluster-analysis