按 pyspark 中的时差将行拆分为多个会话
split row into multiple sessions by time difference in pyspark
这是伪数据:
user ts
--------
1 1
1 3
1 10
1 13
1 21
1 24
如果每个用户的相邻时间差>=6,它将被分成两个会话。所以,上面的数据应该拆分如下:
user ts diff
-------------------
1 1 None
1 3 2
-------------------
1 10 7
1 13 3
-------------------
1 21 8
1 24 3
我了解如何通过如下所示的 Window 函数在 pyspark
中生成 diff 列,但我如何才能以 pyspark
的方式为每个用户将其拆分为不同的会话?非常感谢!
select
user,
ts,
(lag(ts, 1) over (partition by user order by ts asc)) as diff
from user_event
你有一个正确的开始。 SQL 将继续为:
select user, ts, diff,
sum(case when diff > 6 then 1 else 0 end) over (partition by user order by ts) as session_grouping
from (select user, ts,
lag(ts, 1) over (partition by user order by ts asc) as diff
from user_event
) ue;
这是伪数据:
user ts
--------
1 1
1 3
1 10
1 13
1 21
1 24
如果每个用户的相邻时间差>=6,它将被分成两个会话。所以,上面的数据应该拆分如下:
user ts diff
-------------------
1 1 None
1 3 2
-------------------
1 10 7
1 13 3
-------------------
1 21 8
1 24 3
我了解如何通过如下所示的 Window 函数在 pyspark
中生成 diff 列,但我如何才能以 pyspark
的方式为每个用户将其拆分为不同的会话?非常感谢!
select
user,
ts,
(lag(ts, 1) over (partition by user order by ts asc)) as diff
from user_event
你有一个正确的开始。 SQL 将继续为:
select user, ts, diff,
sum(case when diff > 6 then 1 else 0 end) over (partition by user order by ts) as session_grouping
from (select user, ts,
lag(ts, 1) over (partition by user order by ts asc) as diff
from user_event
) ue;