Pyspark 获取前导值
Pyspark get predecessor value
我有一个类似于这个的数据集
exp
pid
mat
pskey
order
1
CR
P
1-CR-P
1
1
M
C
1-M-C
2
1
CR
C
1-CR-C
3
1
PP
C
1-PP-C
4
2
CR
P
2-CR-P
1
2
CR
P
2-CR-P
1
2
M
C
2-M-C
2
2
CR
C
2-CR-C
3
2
CR
C
2-CR-C
3
2
CR
C
2-CR-C
3
2
CR
C
2-CR-C
3
2
CR
C
2-CR-C
3
2
PP
C
2-PP-C
4
2
PP
C
2-PP-C
4
2
PP
C
2-PP-C
4
2
PP
C
2-PP-C
4
2
PP
C
2-PP-C
4
3
M
C
3-M-C
2
4
CR
P
4-CR-P
1
4
M
C
4-M-C
2
4
CR
C
4-CR-C
3
4
PP
C
4-PP-C
4
我需要的是为相同的 exp 获取前任的 pskey,给出以下关系:
订单 1 -> 没有前任
订单 2 -> 没有前任
订单 3 -> [1,2]
订单 4 -> [3]
并将这些值添加到名为 predecessor
的新列中
预期结果如下:
+---+---+---+------+-----+----------------------------------------+
|exp|pid|mat|pskey |order|predecessor |
+---+---+---+------+-----+----------------------------------------+
|1 |CR |P |1-CR-P|1 |null |
|1 |M |C |1-M-C |2 |null |
|1 |CR |C |1-CR-C|3 |[1-CR-P, 1-M-C ] |
|1 |PP |C |1-PP-C|4 |[1-CR-C] |
|3 |M |C |3-M-C |2 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |M |C |2-M-C |2 |null |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|4 |CR |P |4-CR-P|1 |null |
|4 |M |C |4-M-C |2 |null |
|4 |CR |C |4-CR-C|3 |[4-CR-P, 4-M-C] |
|4 |PP |C |4-PP-C|4 |[4-CR-C] |
+---+---+---+------+-----+----------------------------------------+
我对 pyspark 很陌生,所以我不知道如何管理它。
order
上的不同情况用 when
处理。您使用 collect_set
聚合值以获得 unic 标识符:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
"predecessor",
F.when(
F.col("order") == 3,
F.collect_set(F.col("pskey")).over(
Window.partitionBy("exp").orderBy("order").rangeBetween(-2, -1)
),
).when(
F.col("order") == 4,
F.collect_set(F.col("pskey")).over(
Window.partitionBy("exp").orderBy("order").rangeBetween(-1, -1)
),
),
)
结果:
df2.show(truncate=False)
+---+---+---+------+-----+----------------+
|exp|pid|mat|pskey |order|predecessor |
+---+---+---+------+-----+----------------+
|1 |CR |P |1-CR-P|1 |null |
|1 |M |C |1-M-C |2 |null |
|1 |CR |C |1-CR-C|3 |[1-CR-P, 1-M-C ]|
|1 |PP |C |1-PP-C|4 |[1-CR-C] |
|3 |M |C |3-M-C |2 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |M |C |2-M-C |2 |null |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|4 |CR |P |4-CR-P|1 |null |
|4 |M |C |4-M-C |2 |null |
+---+---+---+------+-----+----------------+
only showing top 20 rows
我有一个类似于这个的数据集
exp | pid | mat | pskey | order |
---|---|---|---|---|
1 | CR | P | 1-CR-P | 1 |
1 | M | C | 1-M-C | 2 |
1 | CR | C | 1-CR-C | 3 |
1 | PP | C | 1-PP-C | 4 |
2 | CR | P | 2-CR-P | 1 |
2 | CR | P | 2-CR-P | 1 |
2 | M | C | 2-M-C | 2 |
2 | CR | C | 2-CR-C | 3 |
2 | CR | C | 2-CR-C | 3 |
2 | CR | C | 2-CR-C | 3 |
2 | CR | C | 2-CR-C | 3 |
2 | CR | C | 2-CR-C | 3 |
2 | PP | C | 2-PP-C | 4 |
2 | PP | C | 2-PP-C | 4 |
2 | PP | C | 2-PP-C | 4 |
2 | PP | C | 2-PP-C | 4 |
2 | PP | C | 2-PP-C | 4 |
3 | M | C | 3-M-C | 2 |
4 | CR | P | 4-CR-P | 1 |
4 | M | C | 4-M-C | 2 |
4 | CR | C | 4-CR-C | 3 |
4 | PP | C | 4-PP-C | 4 |
我需要的是为相同的 exp 获取前任的 pskey,给出以下关系:
订单 1 -> 没有前任
订单 2 -> 没有前任
订单 3 -> [1,2]
订单 4 -> [3]
并将这些值添加到名为 predecessor
预期结果如下:
+---+---+---+------+-----+----------------------------------------+
|exp|pid|mat|pskey |order|predecessor |
+---+---+---+------+-----+----------------------------------------+
|1 |CR |P |1-CR-P|1 |null |
|1 |M |C |1-M-C |2 |null |
|1 |CR |C |1-CR-C|3 |[1-CR-P, 1-M-C ] |
|1 |PP |C |1-PP-C|4 |[1-CR-C] |
|3 |M |C |3-M-C |2 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |M |C |2-M-C |2 |null |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|4 |CR |P |4-CR-P|1 |null |
|4 |M |C |4-M-C |2 |null |
|4 |CR |C |4-CR-C|3 |[4-CR-P, 4-M-C] |
|4 |PP |C |4-PP-C|4 |[4-CR-C] |
+---+---+---+------+-----+----------------------------------------+
我对 pyspark 很陌生,所以我不知道如何管理它。
order
上的不同情况用 when
处理。您使用 collect_set
聚合值以获得 unic 标识符:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
"predecessor",
F.when(
F.col("order") == 3,
F.collect_set(F.col("pskey")).over(
Window.partitionBy("exp").orderBy("order").rangeBetween(-2, -1)
),
).when(
F.col("order") == 4,
F.collect_set(F.col("pskey")).over(
Window.partitionBy("exp").orderBy("order").rangeBetween(-1, -1)
),
),
)
结果:
df2.show(truncate=False)
+---+---+---+------+-----+----------------+
|exp|pid|mat|pskey |order|predecessor |
+---+---+---+------+-----+----------------+
|1 |CR |P |1-CR-P|1 |null |
|1 |M |C |1-M-C |2 |null |
|1 |CR |C |1-CR-C|3 |[1-CR-P, 1-M-C ]|
|1 |PP |C |1-PP-C|4 |[1-CR-C] |
|3 |M |C |3-M-C |2 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |M |C |2-M-C |2 |null |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|4 |CR |P |4-CR-P|1 |null |
|4 |M |C |4-M-C |2 |null |
+---+---+---+------+-----+----------------+
only showing top 20 rows