按 pyspark 中的时差将行拆分为多个会话

split row into multiple sessions by time difference in pyspark

这是伪数据:

user  ts
--------
1     1
1     3
1     10
1     13
1     21
1     24

如果每个用户的相邻时间差>=6,它将被分成两个会话。所以,上面的数据应该拆分如下:

user    ts    diff
-------------------
1       1     None
1       3     2
-------------------
1       10    7
1       13    3
-------------------
1       21    8
1       24    3

我了解如何通过如下所示的 Window 函数在 pyspark 中生成 diff 列,但我如何才能以 pyspark 的方式为每个用户将其拆分为不同的会话?非常感谢!

select
   user,
   ts,
   (lag(ts, 1) over (partition by user order by ts asc)) as diff
from user_event

你有一个正确的开始。 SQL 将继续为:

select user, ts, diff,
       sum(case when diff > 6 then 1 else 0 end) over (partition by user order by ts) as session_grouping
from (select user, ts,
             lag(ts, 1) over (partition by user order by ts asc) as diff
      from user_event
     ) ue;