在 SPARK 数据框中，我想 groupBy，然后 orderBY 和它们连接另一列的行

Question

我有包含以下列的 SPARK 数据框：

ID ：它是一个 id，数字而不是唯一的
日期：日期时间戳
名称: 字符串

我想先groupBy("ID")然后orderBy("Date")然后concatenate名字。

所以这个数据框：

ID  Date          Name
1   01-02-2019    x
1   04-02-2019    z
2   05-03-2019    b
1   03-02-2019    y
2   02-03-2019    a

应该转换成这样：

ID  Name_concat
1   x,y,z
2   a,b

请提供 spark scala 语法来完成上述操作。

这段代码能够连接每个 id 的字符串，但它没有保持顺序。

df.orderBy("id","date").groupBy("id").agg(concat_ws(", ", collect_list($"name")).as("all_name"))

Answer 1

df.show
+---+----------+---+
| id|      Date|  v|
+---+----------+---+
|  1|2019-02-01|  x|
|  1|2019-02-04|  z|
|  2|2019-05-03|  a|
|  1|2019-02-03|  y|
|  2|2019-05-02|  b|
|  2|2019-05-06|  c|
+---+----------+---+


val window = Window.partitionBy(col("id")).orderBy(col("Date"))

df.withColumn("test",collect_list("v").over(window)).groupBy("id").agg(last("test")).show

+---+-----------------+
| id|last(test, false)|
+---+-----------------+
|  1|        [x, y, z]|
|  2|        [b, a, c]|
+---+-----------------+

在 SPARK 数据框中，我想 groupBy，然后 orderBY 和它们连接另一列的行

In a SPARK dataframe, I want to groupBy, then orderBY and them concatenate rows of another column

scala

group-by

string-concatenation

apache-spark