PYSPARK SQL 中的枢轴

Question

我需要在下面 table 上使用 pivot。

id,week,score
102,1,96
101,1,138
102,1,37
101,1,59
101,2,282
102,2,212
102,2,78
101,2,97
102,3,60
102,3,123
101,3,220
101,3,87

输出

      1         2         3

101   138,59   282,97     220,87
102   96,37    212,78     123,60

这里我需要对分数进行排序

我试过下面的代码，但只有在特定 id 上只有一条记录时才有效

df.groupBy("id").pivot("week").agg(first("score"))

Answer 1

您应该使用 collect_list 收集所有值而不是 first，这将在列表

中给出结果

import org.apache.spark.sql.functions._

df.groupBy("id").pivot("week").agg(collect_list("score")).show()

输出：

+---+---------+---------+---------+
|id |1        |2        |3        |
+---+---------+---------+---------+
|101|[138, 59]|[282, 97]|[220, 87]|
|102|[96, 37] |[212, 78]|[60, 123]|
+---+---------+---------+---------+

Answer 2

Prasad Khode 发布的 python pyspark 的等效答案如下

from pyspark.sql import functions as F
df.groupBy("id").pivot("week").agg(F.collect_list("score")).show()

如果您查看 api 文档，您可以看到

collect_list(Column e)
Aggregate function: returns a list of objects with duplicates.

您也可以使用 collect_set，这将为您提供删除重复项后的相同输出。

df.groupBy("id").pivot("week").agg(F.collect_set("score")).show()

api 文件说如下

collect_set(Column e)
Aggregate function: returns a set of objects with duplicate elements eliminated.

PYSPARK SQL 中的枢轴

pivot in PYSPARKSQL

sql

apache-spark

apache-spark-sql

pyspark

pyspark-sql