在 Hive 中,如何按时间 session 和数据分页对日志顺序进行排序
In Hive, how to sort log order by time session and page in data
在Hive中,日志数据中重复的页面,
想要分开和排序会话,只剩下他们第一次的时间
ID Page Timestamp
Orestes Login 152356
Orestes Login 152360
Orestes Account view 152368
Orestes Account view 152372
Orestes Transfer 152380
Orestes Account view 152382
Orestes Account view 152390
Orestes Loan 152393
Antigone Login 152382
Antigone Transfer 152390
Antigone Account view 152392
Antigone Account view 152395
Antigone Trust 152399
我想修改如下。
ID Page Timestamp Sequence
Orestes Login 152356 1
Orestes Account view 152368 2
Orestes Transfer 152380 3
Orestes Account view 152382 4
Orestes Loan 152393 5
Antigone Login 152382 1
Antigone Transfer 152390 2
Antigone Account view 152392 3
Antigone Trust 152399 4
Table 脚本是...
insert into log values('Orestes','Login',152356)
insert into log values('Orestes','Login',152360)
insert into log values('Orestes','Account view',152368)
insert into log values('Orestes','Account view',152372)
insert into log values('Orestes','Transfer',152380)
insert into log values('Orestes','Account view',152382)
insert into log values('Orestes','Account view',152390)
insert into log values('Orestes','Loan',152393)
insert into log values('Antigone','Login',152382)
insert into log values('Antigone','Transfer',152390)
insert into log values('Antigone','Account view',152392)
insert into log values('Antigone','Account view',152395)
insert into log values('Antigone','Trust',152399)```
对于这份工作,
With cte as
(
Select id, page, min(timestamp) timestamp from log group by id, page)
)
Select id, page, timestamp, rank() over (partition by id order by timestamp) from log
然而,在这种情况下,orestes 的帐户视图之一丢失了。
我该如何解决这个问题?
使用 LAG 可以找到上一页并在重复时进行过滤。
with log as (
select stack (13,
'Orestes','Login', 152356,
'Orestes','Login', 152360,
'Orestes','Account view', 152368,
'Orestes','Account view', 152372,
'Orestes','Transfer', 152380,
'Orestes','Account view', 152382,
'Orestes','Account view', 152390,
'Orestes','Loan', 152393,
'Antigone','Login', 152382,
'Antigone','Transfer', 152390,
'Antigone','Account view', 152392,
'Antigone','Account view', 152395,
'Antigone','Trust', 152399
) as (ID,Page,Timestamp)
)
select id, page, timestamp, row_number() over(partition by id order by timestamp) sequence
from
(
select id, page, timestamp, lag(page) over(partition by id order by timestamp) prev_page
from log
)s
where (prev_page!=page) or (prev_page is null)
;
结果:
OK
Antigone Login 152382 1
Antigone Transfer 152390 2
Antigone Account view 152392 3
Antigone Trust 152399 4
Orestes Login 152356 1
Orestes Account view 152368 2
Orestes Transfer 152380 3
Orestes Account view 152382 4
Orestes Loan 152393 5
Time taken: 9.359 seconds, Fetched: 9 row(s)
在Hive中,日志数据中重复的页面, 想要分开和排序会话,只剩下他们第一次的时间
ID Page Timestamp Orestes Login 152356 Orestes Login 152360 Orestes Account view 152368 Orestes Account view 152372 Orestes Transfer 152380 Orestes Account view 152382 Orestes Account view 152390 Orestes Loan 152393 Antigone Login 152382 Antigone Transfer 152390 Antigone Account view 152392 Antigone Account view 152395 Antigone Trust 152399
我想修改如下。
ID Page Timestamp Sequence Orestes Login 152356 1 Orestes Account view 152368 2 Orestes Transfer 152380 3 Orestes Account view 152382 4 Orestes Loan 152393 5 Antigone Login 152382 1 Antigone Transfer 152390 2 Antigone Account view 152392 3 Antigone Trust 152399 4
Table 脚本是...
insert into log values('Orestes','Login',152356)
insert into log values('Orestes','Login',152360)
insert into log values('Orestes','Account view',152368)
insert into log values('Orestes','Account view',152372)
insert into log values('Orestes','Transfer',152380)
insert into log values('Orestes','Account view',152382)
insert into log values('Orestes','Account view',152390)
insert into log values('Orestes','Loan',152393)
insert into log values('Antigone','Login',152382)
insert into log values('Antigone','Transfer',152390)
insert into log values('Antigone','Account view',152392)
insert into log values('Antigone','Account view',152395)
insert into log values('Antigone','Trust',152399)```
对于这份工作,
With cte as
(
Select id, page, min(timestamp) timestamp from log group by id, page)
)
Select id, page, timestamp, rank() over (partition by id order by timestamp) from log
然而,在这种情况下,orestes 的帐户视图之一丢失了。 我该如何解决这个问题?
使用 LAG 可以找到上一页并在重复时进行过滤。
with log as (
select stack (13,
'Orestes','Login', 152356,
'Orestes','Login', 152360,
'Orestes','Account view', 152368,
'Orestes','Account view', 152372,
'Orestes','Transfer', 152380,
'Orestes','Account view', 152382,
'Orestes','Account view', 152390,
'Orestes','Loan', 152393,
'Antigone','Login', 152382,
'Antigone','Transfer', 152390,
'Antigone','Account view', 152392,
'Antigone','Account view', 152395,
'Antigone','Trust', 152399
) as (ID,Page,Timestamp)
)
select id, page, timestamp, row_number() over(partition by id order by timestamp) sequence
from
(
select id, page, timestamp, lag(page) over(partition by id order by timestamp) prev_page
from log
)s
where (prev_page!=page) or (prev_page is null)
;
结果:
OK
Antigone Login 152382 1
Antigone Transfer 152390 2
Antigone Account view 152392 3
Antigone Trust 152399 4
Orestes Login 152356 1
Orestes Account view 152368 2
Orestes Transfer 152380 3
Orestes Account view 152382 4
Orestes Loan 152393 5
Time taken: 9.359 seconds, Fetched: 9 row(s)