Pyspark 获取前导值

Pyspark get predecessor value

我有一个类似于这个的数据集

exp pid mat pskey order
1 CR P 1-CR-P 1
1 M C 1-M-C 2
1 CR C 1-CR-C 3
1 PP C 1-PP-C 4
2 CR P 2-CR-P 1
2 CR P 2-CR-P 1
2 M C 2-M-C 2
2 CR C 2-CR-C 3
2 CR C 2-CR-C 3
2 CR C 2-CR-C 3
2 CR C 2-CR-C 3
2 CR C 2-CR-C 3
2 PP C 2-PP-C 4
2 PP C 2-PP-C 4
2 PP C 2-PP-C 4
2 PP C 2-PP-C 4
2 PP C 2-PP-C 4
3 M C 3-M-C 2
4 CR P 4-CR-P 1
4 M C 4-M-C 2
4 CR C 4-CR-C 3
4 PP C 4-PP-C 4

我需要的是为相同的 exp 获取前任的 pskey,给出以下关系:

订单 1 -> 没有前任

订单 2 -> 没有前任

订单 3 -> [1,2]

订单 4 -> [3]

并将这些值添加到名为 predecessor

的新列中

预期结果如下:

+---+---+---+------+-----+----------------------------------------+
|exp|pid|mat|pskey |order|predecessor                             |
+---+---+---+------+-----+----------------------------------------+
|1  |CR |P  |1-CR-P|1    |null                                    |
|1  |M  |C  |1-M-C |2    |null                                    |
|1  |CR |C  |1-CR-C|3    |[1-CR-P, 1-M-C ]                        |
|1  |PP |C  |1-PP-C|4    |[1-CR-C]                                |
|3  |M  |C  |3-M-C |2    |null                                    |
|2  |CR |P  |2-CR-P|1    |null                                    |
|2  |CR |P  |2-CR-P|1    |null                                    |
|2  |M  |C  |2-M-C |2    |null                                    |
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C]                         |
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C]                         |
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C]                         |
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C]                         |
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C]                         |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]                                |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]                                |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]                                |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]                                |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]                                |
|4  |CR |P  |4-CR-P|1    |null                                    |
|4  |M  |C  |4-M-C |2    |null                                    |
|4  |CR |C  |4-CR-C|3    |[4-CR-P, 4-M-C]                         |
|4  |PP |C  |4-PP-C|4    |[4-CR-C]                                |
+---+---+---+------+-----+----------------------------------------+

我对 pyspark 很陌生,所以我不知道如何管理它。

order 上的不同情况用 when 处理。您使用 collect_set 聚合值以获得 unic 标识符:

from pyspark.sql import functions as F, Window 

df2 = df.withColumn(
    "predecessor",
    F.when(
        F.col("order") == 3,
        F.collect_set(F.col("pskey")).over(
            Window.partitionBy("exp").orderBy("order").rangeBetween(-2, -1)
        ),
    ).when(
        F.col("order") == 4,
        F.collect_set(F.col("pskey")).over(
            Window.partitionBy("exp").orderBy("order").rangeBetween(-1, -1)
        ),
    ),
)

结果:

df2.show(truncate=False)
+---+---+---+------+-----+----------------+                                     
|exp|pid|mat|pskey |order|predecessor     |
+---+---+---+------+-----+----------------+
|1  |CR |P  |1-CR-P|1    |null            |
|1  |M  |C  |1-M-C |2    |null            |
|1  |CR |C  |1-CR-C|3    |[1-CR-P, 1-M-C ]|
|1  |PP |C  |1-PP-C|4    |[1-CR-C]        |
|3  |M  |C  |3-M-C |2    |null            |
|2  |CR |P  |2-CR-P|1    |null            |
|2  |CR |P  |2-CR-P|1    |null            |
|2  |M  |C  |2-M-C |2    |null            |
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C ]|
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C ]|
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C ]|
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C ]|
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C ]|
|2  |PP |C  |2-PP-C|4    |[2-CR-C]        |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]        |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]        |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]        |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]        |
|4  |CR |P  |4-CR-P|1    |null            |
|4  |M  |C  |4-M-C |2    |null            |
+---+---+---+------+-----+----------------+
only showing top 20 rows