Group By 并在 spark 中标准化
Group By and standardize in spark
我有以下数据框:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,2,3],[1,2,1],[1,2,2],[2,2,2],[2,3,2],[2,4,2]],columns=["a","b","c"])
df = df.set_index("a")
df.groupby("a").mean()
df.groupby("a").std()
我想标准化每个键的数据帧,不标准化整个列向量。
因此对于以下示例,输出将是:
a = 1:
Column: b
(2 - 2) / 0.0
(2 - 2) / 0.0
(2 - 2) / 0.0
Column: c
(3 - 2) / 1.0
(1 - 2) / 1.0
(2 - 2) / 1.0
然后我会在每个组中标准化每个值
我如何在 spark 中做到这一点?
谢谢
与Spark
DataFrame
:
sdf = spark.createDataFrame(df)
进口:
from pyspark.sql.functions import *
from pyspark.sql.window import Window
def z_score(c, w):
return (col(c) - mean(c).over(w)) / stddev(c).over(w)
Window:
w = Window.partitionBy("a")
解决方案:
sdf.select("a", z_score("b", w).alias("a"), z_score("c", w).alias("b")).show()
+---+----+----+
| a| a| b|
+---+----+----+
| 1|null| 1.0|
| 1|null|-1.0|
| 1|null| 0.0|
| 2|-1.0|null|
| 2| 0.0|null|
| 2| 1.0|null|
+---+----+----+
我有以下数据框:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,2,3],[1,2,1],[1,2,2],[2,2,2],[2,3,2],[2,4,2]],columns=["a","b","c"])
df = df.set_index("a")
df.groupby("a").mean()
df.groupby("a").std()
我想标准化每个键的数据帧,不标准化整个列向量。
因此对于以下示例,输出将是:
a = 1:
Column: b
(2 - 2) / 0.0
(2 - 2) / 0.0
(2 - 2) / 0.0
Column: c
(3 - 2) / 1.0
(1 - 2) / 1.0
(2 - 2) / 1.0
然后我会在每个组中标准化每个值
我如何在 spark 中做到这一点?
谢谢
与Spark
DataFrame
:
sdf = spark.createDataFrame(df)
进口:
from pyspark.sql.functions import *
from pyspark.sql.window import Window
def z_score(c, w):
return (col(c) - mean(c).over(w)) / stddev(c).over(w)
Window:
w = Window.partitionBy("a")
解决方案:
sdf.select("a", z_score("b", w).alias("a"), z_score("c", w).alias("b")).show()
+---+----+----+
| a| a| b|
+---+----+----+
| 1|null| 1.0|
| 1|null|-1.0|
| 1|null| 0.0|
| 2|-1.0|null|
| 2| 0.0|null|
| 2| 1.0|null|
+---+----+----+