Spark - Window 递归？ - 有条件地跨行传播值

Question

我有以下显示购买收入的数据框。

+-------+--------+-------+
|user_id|visit_id|revenue|
+-------+--------+-------+
|      1|       1|      0|
|      1|       2|      0|
|      1|       3|      0|
|      1|       4|    100|
|      1|       5|      0|
|      1|       6|      0|
|      1|       7|    200|
|      1|       8|      0|
|      1|       9|     10|
+-------+--------+-------+

最终我希望新列 purch_revenue 在每一行中显示购买产生的收入。作为解决方法，我还尝试引入一个购买标识符 purch_id，每次购买时该标识符都会递增。所以列出来仅供参考。

+-------+--------+-------+-------------+--------+
|user_id|visit_id|revenue|purch_revenue|purch_id|
+-------+--------+-------+-------------+--------+
|      1|       1|      0|          100|       1|
|      1|       2|      0|          100|       1|
|      1|       3|      0|          100|       1|
|      1|       4|    100|          100|       1|
|      1|       5|      0|          100|       2|
|      1|       6|      0|          100|       2|
|      1|       7|    200|          100|       2|
|      1|       8|      0|          100|       3|
|      1|       9|     10|          100|       3|
+-------+--------+-------+-------------+--------+

我试过像这样使用 lag/lead 函数：

user_timeline = Window.partitionBy("user_id").orderBy("visit_id")
find_rev = fn.when(fn.col("revenue") > 0,fn.col("revenue"))\ 
  .otherwise(fn.lead(fn.col("revenue"), 1).over(user_timeline))
df.withColumn("purch_revenue", find_rev)

如果 revenue > 0 这会复制收入列，并且还会将其向上拉一行。显然，我可以将其链接为有限 N，但这不是解决方案。

有没有办法递归地应用这个直到 revenue > 0？
或者，有没有办法根据条件增加值？我试图找出一种方法来做到这一点，但很难找到一个。

Answer 1

Window 函数不支持递归，但这里不需要。这种类型的分离可以很容易地用累积和来处理：

from pyspark.sql.functions import col, sum, when, lag
from pyspark.sql.window import Window

w = Window.partitionBy("user_id").orderBy("visit_id")
purch_id = sum(lag(when(
    col("revenue") > 0, 1).otherwise(0), 
    1, 0
).over(w)).over(w) + 1

df.withColumn("purch_id", purch_id).show()

+-------+--------+-------+--------+
|user_id|visit_id|revenue|purch_id|
+-------+--------+-------+--------+
|      1|       1|      0|       1|
|      1|       2|      0|       1|
|      1|       3|      0|       1|
|      1|       4|    100|       1|
|      1|       5|      0|       2|
|      1|       6|      0|       2|
|      1|       7|    200|       2|
|      1|       8|      0|       3|
|      1|       9|     10|       3|
+-------+--------+-------+--------+

Spark - Window 递归？ - 有条件地跨行传播值

Spark - Window with recursion? - Conditionally propagating values across rows

window-functions

apache-spark

apache-spark-sql

pyspark

pyspark-sql