在 Pyspark 中将一组列除以其平均值

Question

我必须将 pyspark.sql.dataframe 中的一组列除以它们各自的列平均值，但我找不到正确的方法。下面是示例数据和我现在的代码。

输入数据

columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
    ("2", "45", "55", "tracey smith"),
    ("3", "33", "23", "amy sanders")]

df = spark.createDataFrame(data=data,schema=columns)

Col1    Col2    Col3    Name
 1      40      56  john jones
 2      45      55  tracey smith
 3      33      23  amy sanders

预期输出

Col1    Col2    Col3    Name
 0.5    1.02    1.25    john jones
 1      1.14    1.23    tracey smith
 1.5    0.84    0.51    amy sanders

截至目前的功能。不工作：

#function to divide few columns by the column average and overwrite the column

def avg_scaling(df):

  #List of columns which have to be scaled by their average
  col_list = ['col1', 'col2', 'col3']

  for i in col_list:
   df = df.withcolumn(i, col(i)/df.select(f.avg(df[i])))

   return df

new_df = avg_scaling(df)

Answer 1

您可以在此处使用 Window 在伪列上分区，并在 window 上使用运行平均值。

代码是这样的，

columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
    ("2", "45", "55", "tracey smith"),
    ("3", "33", "23", "amy sanders")]

df = spark.createDataFrame(data=data,schema=columns)

df.show()

+----+----+----+------------+
|Col1|Col2|Col3|        Name|
+----+----+----+------------+
|   1|  40|  56|  john jones|
|   2|  45|  55|tracey smith|
|   3|  33|  23| amy sanders|
+----+----+----+------------+


from pyspark.sql import Window

def avg_scaling(df, cols_to_scale):
    
    w = Window.partitionBy(F.lit(1))
    for col in cols_to_scale:
        df = df.withColumn(f"{col}", F.round(F.col(col) / F.avg(col).over(w), 2))
    return df

new_df = avg_scaling(df, ["Col1", 'Col2', 'Col3'])
new_df.show()


+----+----+----+------------+
|Col1|Col2|Col3|        Name|
+----+----+----+------------+
| 0.5|1.02|1.25|  john jones|
| 1.5|0.84|0.51| amy sanders|
| 1.0|1.14|1.23|tracey smith|
+----+----+----+------------+

在 Pyspark 中将一组列除以其平均值

Dividing a set of columns by its average in Pyspark

pyspark

pandas-udf