在 Hive 中,如何按时间 session 和数据分页对日志顺序进行排序

In Hive, how to sort log order by time session and page in data

在Hive中,日志数据中重复的页面, 想要分开和排序会话,只剩下他们第一次的时间

ID          Page            Timestamp
Orestes     Login           152356
Orestes     Login           152360
Orestes     Account view    152368
Orestes     Account view    152372
Orestes     Transfer        152380
Orestes     Account view    152382
Orestes     Account view    152390
Orestes     Loan            152393
Antigone    Login           152382
Antigone    Transfer        152390
Antigone    Account view    152392
Antigone    Account view    152395
Antigone    Trust           152399

我想修改如下。

ID          Page            Timestamp   Sequence
Orestes     Login           152356      1
Orestes     Account view    152368      2
Orestes     Transfer        152380      3
Orestes     Account view    152382      4
Orestes     Loan            152393      5
Antigone    Login           152382      1
Antigone    Transfer        152390      2
Antigone    Account view    152392      3
Antigone    Trust           152399      4

Table 脚本是...

insert into log values('Orestes','Login',152356)
insert into log values('Orestes','Login',152360)
insert into log values('Orestes','Account view',152368)
insert into log values('Orestes','Account view',152372)
insert into log values('Orestes','Transfer',152380)
insert into log values('Orestes','Account view',152382)
insert into log values('Orestes','Account view',152390)
insert into log values('Orestes','Loan',152393)
insert into log values('Antigone','Login',152382)
insert into log values('Antigone','Transfer',152390)
insert into log values('Antigone','Account view',152392)
insert into log values('Antigone','Account view',152395)
insert into log values('Antigone','Trust',152399)```

对于这份工作,

With cte as
(
Select id, page, min(timestamp) timestamp from log group by id, page)
)
Select id, page, timestamp, rank() over (partition by id order by timestamp) from log

然而,在这种情况下,orestes 的帐户视图之一丢失了。 我该如何解决这个问题?

使用 LAG 可以找到上一页并在重复时进行过滤。

with log as (
select stack (13,
'Orestes','Login',           152356,
'Orestes','Login',           152360,
'Orestes','Account view',    152368,
'Orestes','Account view',    152372,
'Orestes','Transfer',        152380,
'Orestes','Account view',    152382,
'Orestes','Account view',    152390,
'Orestes','Loan',            152393,
'Antigone','Login',           152382,
'Antigone','Transfer',        152390,
'Antigone','Account view',    152392,
'Antigone','Account view',    152395,
'Antigone','Trust',           152399
) as (ID,Page,Timestamp) 
)

select id, page, timestamp, row_number() over(partition by id order by timestamp) sequence
from
(
select id, page, timestamp, lag(page) over(partition by id order by timestamp) prev_page
  from log
)s 
where (prev_page!=page) or (prev_page is null)
;

结果:

OK
Antigone        Login   152382  1
Antigone        Transfer        152390  2
Antigone        Account view    152392  3
Antigone        Trust   152399  4
Orestes Login   152356  1
Orestes Account view    152368  2
Orestes Transfer        152380  3
Orestes Account view    152382  4
Orestes Loan    152393  5
Time taken: 9.359 seconds, Fetched: 9 row(s)