Spark 使用 StandardScaler 获取实际的集群质心
Spark get the actual cluster centeroids with StandardScaler
我使用 StandardScaler 安装了具有缩放功能的 KMeans。问题是集群质心也被缩放。是否有可能以编程方式获取原始质心?
import pandas as pd
import numpy as np
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler, StandardScalerModel
from pyspark.ml.clustering import KMeans
from sklearn.datasets import load_iris
# iris data set
iris = load_iris()
iris_data = pd.DataFrame(iris['data'], columns=iris['feature_names'])
iris_df = sqlContext.createDataFrame(iris_data)
assembler = VectorAssembler(
inputCols=[x for x in iris_df.columns],outputCol='features')
data = assembler.transform(iris_df)
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)
scalerModel = scaler.fit(data)
scaledData = scalerModel.transform(data).drop('features').withColumnRenamed('scaledFeatures', 'features')
kmeans = KMeans().setFeaturesCol("features").setPredictionCol("prediction").setK(3)
model = kmeans.fit(scaledData)
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
在这里,我想获得原始比例的质心。
质心被缩放。
[ 7.04524479 6.17347978 2.50588155 1.88127377]
[ 6.0454109 7.88294475 0.82973422 0.31972295]
[ 8.22013841 7.19671468 3.13005178 2.59685552]
你 StandardScaler
与 withStd=True
和 withMean=False
。要回到最初的 space 你必须乘以 std
向量:
[cluster * scalerModel.std for cluster in model.clusterCenters()]
如果 withMean
是 True
,您将使用:
[cluster * scalerModel.std + scalerModel.mean
for cluster in model.clusterCenters()]
我使用 StandardScaler 安装了具有缩放功能的 KMeans。问题是集群质心也被缩放。是否有可能以编程方式获取原始质心?
import pandas as pd
import numpy as np
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler, StandardScalerModel
from pyspark.ml.clustering import KMeans
from sklearn.datasets import load_iris
# iris data set
iris = load_iris()
iris_data = pd.DataFrame(iris['data'], columns=iris['feature_names'])
iris_df = sqlContext.createDataFrame(iris_data)
assembler = VectorAssembler(
inputCols=[x for x in iris_df.columns],outputCol='features')
data = assembler.transform(iris_df)
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)
scalerModel = scaler.fit(data)
scaledData = scalerModel.transform(data).drop('features').withColumnRenamed('scaledFeatures', 'features')
kmeans = KMeans().setFeaturesCol("features").setPredictionCol("prediction").setK(3)
model = kmeans.fit(scaledData)
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
在这里,我想获得原始比例的质心。 质心被缩放。
[ 7.04524479 6.17347978 2.50588155 1.88127377]
[ 6.0454109 7.88294475 0.82973422 0.31972295]
[ 8.22013841 7.19671468 3.13005178 2.59685552]
你 StandardScaler
与 withStd=True
和 withMean=False
。要回到最初的 space 你必须乘以 std
向量:
[cluster * scalerModel.std for cluster in model.clusterCenters()]
如果 withMean
是 True
,您将使用:
[cluster * scalerModel.std + scalerModel.mean
for cluster in model.clusterCenters()]