计算向量与 K-means 聚类中心的距离
Computing distance of a vector from the center of K-means cluster
我有训练数据集,我 运行 K-means 对它使用 K=4 并得到四个聚类中心。对于新数据点,我不仅想知道预测的集群,还想知道距该集群中心的距离。是否有 API 来计算距中心的欧氏距离?如果需要,我可以拨打 2 API 电话。我正在使用 Scala,但在任何地方都找不到任何示例。
以下对我有用...
def EuclideanDistance(x: Array[Double], y: Array[Double]) = {
scala.math.sqrt((xs zip ys).map { case (x,y) => scala.math.pow(y - x, 2.0) }.sum)
}
自 Spark 2.0 Vectors.sqdist 可用于计算两个向量之间的平方距离。
您可以使用 UDF 计算每个点与其中心的距离,如下所示:
import org.apache.spark.ml.linalg.{Vectors, Vector}
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.sql.functions.udf
// Sample points
val points = Seq(Vectors.dense(1,0), Vectors.dense(2,-3), Vectors.dense(0.5, -1), Vectors.dense(1.5, -1.5))
val df = points.map(Tuple1.apply).toDF("features")
// K-means
val kmeans = new KMeans()
.setFeaturesCol("features")
.setK(2)
val kmeansModel = kmeans.fit(df)
val predictedDF = kmeansModel.transform(df)
// predictedDF.schema = (features: Vector, prediction: Int)
// Cluster Centers
kmeansModel.clusterCenters foreach println
/*
[1.75,-2.25]
[0.75,-0.5]
*/
// UDF that calculates for each point distance from each cluster center
val distFromCenter = udf((features: Vector, c: Int) => Vectors.sqdist(features, kmeansModel.clusterCenters(c)))
val distancesDF = predictedDF.withColumn("distanceFromCenter", distFromCenter($"features", $"prediction"))
distancesDF.show(false)
/*
+----------+----------+------------------+
|features |prediction|distanceFromCenter|
+----------+----------+------------------+
|[1.0,0.0] |1 |0.3125 |
|[2.0,-3.0]|0 |0.625 |
|[0.5,-1.0]|1 |0.3125 |
|[1.5,-1.5]|0 |0.625 |
+----------+----------+------------------+
*/
注意:Vectors.sqdist
计算 2 个向量之间的平方距离(没有平方根)。如果你需要欧氏距离,你可以使用 Math.sqrt(Vectors.sqdist(...))
我有训练数据集,我 运行 K-means 对它使用 K=4 并得到四个聚类中心。对于新数据点,我不仅想知道预测的集群,还想知道距该集群中心的距离。是否有 API 来计算距中心的欧氏距离?如果需要,我可以拨打 2 API 电话。我正在使用 Scala,但在任何地方都找不到任何示例。
以下对我有用...
def EuclideanDistance(x: Array[Double], y: Array[Double]) = {
scala.math.sqrt((xs zip ys).map { case (x,y) => scala.math.pow(y - x, 2.0) }.sum)
}
自 Spark 2.0 Vectors.sqdist 可用于计算两个向量之间的平方距离。
您可以使用 UDF 计算每个点与其中心的距离,如下所示:
import org.apache.spark.ml.linalg.{Vectors, Vector}
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.sql.functions.udf
// Sample points
val points = Seq(Vectors.dense(1,0), Vectors.dense(2,-3), Vectors.dense(0.5, -1), Vectors.dense(1.5, -1.5))
val df = points.map(Tuple1.apply).toDF("features")
// K-means
val kmeans = new KMeans()
.setFeaturesCol("features")
.setK(2)
val kmeansModel = kmeans.fit(df)
val predictedDF = kmeansModel.transform(df)
// predictedDF.schema = (features: Vector, prediction: Int)
// Cluster Centers
kmeansModel.clusterCenters foreach println
/*
[1.75,-2.25]
[0.75,-0.5]
*/
// UDF that calculates for each point distance from each cluster center
val distFromCenter = udf((features: Vector, c: Int) => Vectors.sqdist(features, kmeansModel.clusterCenters(c)))
val distancesDF = predictedDF.withColumn("distanceFromCenter", distFromCenter($"features", $"prediction"))
distancesDF.show(false)
/*
+----------+----------+------------------+
|features |prediction|distanceFromCenter|
+----------+----------+------------------+
|[1.0,0.0] |1 |0.3125 |
|[2.0,-3.0]|0 |0.625 |
|[0.5,-1.0]|1 |0.3125 |
|[1.5,-1.5]|0 |0.625 |
+----------+----------+------------------+
*/
注意:Vectors.sqdist
计算 2 个向量之间的平方距离(没有平方根)。如果你需要欧氏距离,你可以使用 Math.sqrt(Vectors.sqdist(...))