Spark 使用 StandardScaler 获取实际的集群质心

Spark get the actual cluster centeroids with StandardScaler

我使用 StandardScaler 安装了具有缩放功能的 KMeans。问题是集群质心也被缩放。是否有可能以编程方式获取原始质心?

import pandas as pd
import numpy as np
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler, StandardScalerModel
from pyspark.ml.clustering import KMeans

from sklearn.datasets import load_iris

# iris data set
iris = load_iris()
iris_data = pd.DataFrame(iris['data'], columns=iris['feature_names'])

iris_df = sqlContext.createDataFrame(iris_data)

assembler = VectorAssembler(
    inputCols=[x for x in iris_df.columns],outputCol='features')

data = assembler.transform(iris_df)

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)
scalerModel = scaler.fit(data)
scaledData = scalerModel.transform(data).drop('features').withColumnRenamed('scaledFeatures', 'features')

kmeans = KMeans().setFeaturesCol("features").setPredictionCol("prediction").setK(3)
model = kmeans.fit(scaledData)
centers = model.clusterCenters()

print("Cluster Centers: ")
for center in centers:
    print(center)

在这里,我想获得原始比例的质心。 质心被缩放。

[ 7.04524479  6.17347978  2.50588155  1.88127377]
[ 6.0454109   7.88294475  0.82973422  0.31972295]
[ 8.22013841  7.19671468  3.13005178  2.59685552]

StandardScalerwithStd=TruewithMean=False。要回到最初的 space 你必须乘以 std 向量:

[cluster * scalerModel.std  for cluster in model.clusterCenters()]

如果 withMeanTrue,您将使用:

[cluster * scalerModel.std + scalerModel.mean 
    for cluster in model.clusterCenters()]