Spark 中的 StandardScaler 未按预期工作
StandardScaler in Spark not working as expected
知道为什么 spark 会为 StandardScaler
这样做吗?根据 StandardScaler
的定义:
The StandardScaler standardizes a set of features to have zero mean
and a standard deviation of 1. The flag withStd will scale the data to
unit standard deviation while the flag withMean (false by default)
will center the data prior to scaling it.
>>> tmpdf.show(4)
+----+----+----+------------+
|int1|int2|int3|temp_feature|
+----+----+----+------------+
| 1| 2| 3| [2.0]|
| 7| 8| 9| [8.0]|
| 4| 5| 6| [5.0]|
+----+----+----+------------+
>>> sScaler = StandardScaler(withMean=True, withStd=True).setInputCol("temp_feature")
>>> sScaler.fit(tmpdf).transform(tmpdf).show()
+----+----+----+------------+-------------------------------------------+
|int1|int2|int3|temp_feature|StandardScaler_4fe08ca180ab163e4120__output|
+----+----+----+------------+-------------------------------------------+
| 1| 2| 3| [2.0]| [-1.0]|
| 7| 8| 9| [8.0]| [1.0]|
| 4| 5| 6| [5.0]| [0.0]|
+----+----+----+------------+-------------------------------------------+
在 numpy 的世界里
>>> x
array([2., 8., 5.])
>>> (x - x.mean())/x.std()
array([-1.22474487, 1.22474487, 0. ])
在 sklearn 世界
>>> scaler = StandardScaler(with_mean=True, with_std=True)
>>> data
[[2.0], [8.0], [5.0]]
>>> print(scaler.fit(data).transform(data))
[[-1.22474487]
[ 1.22474487]
[ 0. ]]
您的结果与预期不符的原因是pyspark.ml.feature.StandardScaler
使用无偏样本标准差而不是总体标准差。
来自文档:
The “unit std” is computed using the corrected sample standard deviation, which is computed as the square root of the unbiased sample variance.
如果您使用示例标准差尝试 numpy
代码,您会看到相同的结果:
import numpy as np
x = np.array([2., 8., 5.])
print((x - x.mean())/x.std(ddof=1))
#array([-1., 1., 0.])
从建模的角度来看,这几乎肯定不是问题(除非您的数据是整个人口,这几乎永远不会是这种情况)。还要记住,对于大样本量,样本标准差接近总体标准差。所以如果你的 DataFrame 中有很多行,这里的差异可以忽略不计。
但是,如果您坚持让缩放器使用总体标准差,一种 "hacky" 方法是向您的 DataFrame 添加一行,即列的平均值。
回想一下,标准差定义为均值差的平方和的平方根。或者作为函数:
# using the same x as above
def popstd(x):
return np.sqrt(sum((xi - x.mean())**2/len(x) for xi in x))
print(popstd(x))
#2.4494897427831779
print(x.std())
#2.4494897427831779
使用无偏标准差时的区别只是除以 len(x)-1
而不是 len(x)
。因此,如果您添加了一个等于平均值的样本,您将增加分母而不影响总体平均值。
假设您有以下 DataFrame:
df = spark.createDataFrame(
np.array(range(1,10,1)).reshape(3,3).tolist(),
["int1", "int2", "int3"]
)
df.show()
#+----+----+----+
#|int1|int2|int3|
#+----+----+----+
#| 1| 2| 3|
#| 4| 5| 6|
#| 7| 8| 9|
#+----+----+----+
将此 DataFrame 与每列的平均值合并:
import pyspark.sql.functions as f
# This is equivalent to UNION ALL in SQL
df2 = df.union(df.select(*[f.avg(c).alias(c) for c in df.columns]))
现在调整你的价值观:
from pyspark.ml.feature import VectorAssembler, StandardScaler
va = VectorAssembler(inputCols=["int2"], outputCol="temp_feature")
tmpdf = va.transform(df2)
sScaler = StandardScaler(
withMean=True, withStd=True, inputCol="temp_feature", outputCol="scaled"
)
sScaler.fit(tmpdf).transform(tmpdf).show()
#+----+----+----+------------+---------------------+
#|int1|int2|int3|temp_feature|scaled |
#+----+----+----+------------+---------------------+
#|1.0 |2.0 |3.0 |[2.0] |[-1.2247448713915892]|
#|4.0 |5.0 |6.0 |[5.0] |[0.0] |
#|7.0 |8.0 |9.0 |[8.0] |[1.2247448713915892] |
#|4.0 |5.0 |6.0 |[5.0] |[0.0] |
#+----+----+----+------------+---------------------+
知道为什么 spark 会为 StandardScaler
这样做吗?根据 StandardScaler
的定义:
The StandardScaler standardizes a set of features to have zero mean and a standard deviation of 1. The flag withStd will scale the data to unit standard deviation while the flag withMean (false by default) will center the data prior to scaling it.
>>> tmpdf.show(4)
+----+----+----+------------+
|int1|int2|int3|temp_feature|
+----+----+----+------------+
| 1| 2| 3| [2.0]|
| 7| 8| 9| [8.0]|
| 4| 5| 6| [5.0]|
+----+----+----+------------+
>>> sScaler = StandardScaler(withMean=True, withStd=True).setInputCol("temp_feature")
>>> sScaler.fit(tmpdf).transform(tmpdf).show()
+----+----+----+------------+-------------------------------------------+
|int1|int2|int3|temp_feature|StandardScaler_4fe08ca180ab163e4120__output|
+----+----+----+------------+-------------------------------------------+
| 1| 2| 3| [2.0]| [-1.0]|
| 7| 8| 9| [8.0]| [1.0]|
| 4| 5| 6| [5.0]| [0.0]|
+----+----+----+------------+-------------------------------------------+
在 numpy 的世界里
>>> x
array([2., 8., 5.])
>>> (x - x.mean())/x.std()
array([-1.22474487, 1.22474487, 0. ])
在 sklearn 世界
>>> scaler = StandardScaler(with_mean=True, with_std=True)
>>> data
[[2.0], [8.0], [5.0]]
>>> print(scaler.fit(data).transform(data))
[[-1.22474487]
[ 1.22474487]
[ 0. ]]
您的结果与预期不符的原因是pyspark.ml.feature.StandardScaler
使用无偏样本标准差而不是总体标准差。
来自文档:
The “unit std” is computed using the corrected sample standard deviation, which is computed as the square root of the unbiased sample variance.
如果您使用示例标准差尝试 numpy
代码,您会看到相同的结果:
import numpy as np
x = np.array([2., 8., 5.])
print((x - x.mean())/x.std(ddof=1))
#array([-1., 1., 0.])
从建模的角度来看,这几乎肯定不是问题(除非您的数据是整个人口,这几乎永远不会是这种情况)。还要记住,对于大样本量,样本标准差接近总体标准差。所以如果你的 DataFrame 中有很多行,这里的差异可以忽略不计。
但是,如果您坚持让缩放器使用总体标准差,一种 "hacky" 方法是向您的 DataFrame 添加一行,即列的平均值。
回想一下,标准差定义为均值差的平方和的平方根。或者作为函数:
# using the same x as above
def popstd(x):
return np.sqrt(sum((xi - x.mean())**2/len(x) for xi in x))
print(popstd(x))
#2.4494897427831779
print(x.std())
#2.4494897427831779
使用无偏标准差时的区别只是除以 len(x)-1
而不是 len(x)
。因此,如果您添加了一个等于平均值的样本,您将增加分母而不影响总体平均值。
假设您有以下 DataFrame:
df = spark.createDataFrame(
np.array(range(1,10,1)).reshape(3,3).tolist(),
["int1", "int2", "int3"]
)
df.show()
#+----+----+----+
#|int1|int2|int3|
#+----+----+----+
#| 1| 2| 3|
#| 4| 5| 6|
#| 7| 8| 9|
#+----+----+----+
将此 DataFrame 与每列的平均值合并:
import pyspark.sql.functions as f
# This is equivalent to UNION ALL in SQL
df2 = df.union(df.select(*[f.avg(c).alias(c) for c in df.columns]))
现在调整你的价值观:
from pyspark.ml.feature import VectorAssembler, StandardScaler
va = VectorAssembler(inputCols=["int2"], outputCol="temp_feature")
tmpdf = va.transform(df2)
sScaler = StandardScaler(
withMean=True, withStd=True, inputCol="temp_feature", outputCol="scaled"
)
sScaler.fit(tmpdf).transform(tmpdf).show()
#+----+----+----+------------+---------------------+
#|int1|int2|int3|temp_feature|scaled |
#+----+----+----+------------+---------------------+
#|1.0 |2.0 |3.0 |[2.0] |[-1.2247448713915892]|
#|4.0 |5.0 |6.0 |[5.0] |[0.0] |
#|7.0 |8.0 |9.0 |[8.0] |[1.2247448713915892] |
#|4.0 |5.0 |6.0 |[5.0] |[0.0] |
#+----+----+----+------------+---------------------+