无法通过 PySpark 中的多行获取平均值和标准差
Not able to get the average and standard deviation by multiple rows in PySpark
root
|-- cores: long (nullable = true)
|-- time0: double (nullable = true)
|-- time1: double (nullable = true)
|-- time2: double (nullable = true)
+-----+------------------+------------------+-----------------+
|cores|time0 |time1 |time2 |
+-----+------------------+------------------+-----------------+
|1 |26.362340927124023|25.891045093536377|26.19786810874939|
|2 |28.445404767990112|32.81148290634155 |30.37511706352234|
|4 |29.17068886756897 |28.47817611694336 |29.78126311302185|
+-----+------------------+------------------+-----------------+
我想要结果数据框,其中还包括均值和标准差列。
df_mean_stddev = df_cores.withColumn('*', F.mean(array(df_cores.columns[1:])).alias('mean'))
.withColumn(stddev(array(df_cores.columns[1:])).alias('stddev'))
df_mean_stddev.printSchema()
df_cores.show(truncate=False)
我试过上面的方法,但出现错误。 None 的示例似乎对我来说工作正常,按行引用多个聚合。我是 PySpark 的新手。
mean and stddev 可以计算列的均值和标准差,但这些函数不适用于行。
计算每行值的一种方法是创建一个 udf,然后使用标准的 Python 方法。但由于数据集只有三列,公式也可以直接写成SQL:
df.withColumn("mean", F.expr("(time0 + time1 + time2)/3")) \
.withColumn("stddev", F.expr("sqrt((pow((time0-mean),2)+pow((time1-mean),2)+pow((time2-mean),2))/2)")) \
.show()
打印
+-----+------------------+------------------+-----------------+------------------+-------------------+
|cores| time0| time1| time2| mean| stddev|
+-----+------------------+------------------+-----------------+------------------+-------------------+
| 1|26.362340927124023|25.891045093536377|26.19786810874939|26.150418043136597|0.23920403891711864|
| 2|28.445404767990112| 32.81148290634155|30.37511706352234|30.544001579284668| 2.1879330570873967|
| 4| 29.17068886756897| 28.47817611694336|29.78126311302185|29.143376032511394| 0.6519727164969239|
+-----+------------------+------------------+-----------------+------------------+-------------------+
root
|-- cores: long (nullable = true)
|-- time0: double (nullable = true)
|-- time1: double (nullable = true)
|-- time2: double (nullable = true)
+-----+------------------+------------------+-----------------+
|cores|time0 |time1 |time2 |
+-----+------------------+------------------+-----------------+
|1 |26.362340927124023|25.891045093536377|26.19786810874939|
|2 |28.445404767990112|32.81148290634155 |30.37511706352234|
|4 |29.17068886756897 |28.47817611694336 |29.78126311302185|
+-----+------------------+------------------+-----------------+
我想要结果数据框,其中还包括均值和标准差列。
df_mean_stddev = df_cores.withColumn('*', F.mean(array(df_cores.columns[1:])).alias('mean'))
.withColumn(stddev(array(df_cores.columns[1:])).alias('stddev'))
df_mean_stddev.printSchema()
df_cores.show(truncate=False)
我试过上面的方法,但出现错误。 None 的示例似乎对我来说工作正常,按行引用多个聚合。我是 PySpark 的新手。
mean and stddev 可以计算列的均值和标准差,但这些函数不适用于行。
计算每行值的一种方法是创建一个 udf,然后使用标准的 Python 方法。但由于数据集只有三列,公式也可以直接写成SQL:
df.withColumn("mean", F.expr("(time0 + time1 + time2)/3")) \
.withColumn("stddev", F.expr("sqrt((pow((time0-mean),2)+pow((time1-mean),2)+pow((time2-mean),2))/2)")) \
.show()
打印
+-----+------------------+------------------+-----------------+------------------+-------------------+
|cores| time0| time1| time2| mean| stddev|
+-----+------------------+------------------+-----------------+------------------+-------------------+
| 1|26.362340927124023|25.891045093536377|26.19786810874939|26.150418043136597|0.23920403891711864|
| 2|28.445404767990112| 32.81148290634155|30.37511706352234|30.544001579284668| 2.1879330570873967|
| 4| 29.17068886756897| 28.47817611694336|29.78126311302185|29.143376032511394| 0.6519727164969239|
+-----+------------------+------------------+-----------------+------------------+-------------------+