Hive SQL：过滤掉包含特定列重复值的行

Question

我有一个包含以下数据的配置单元 table：（第一行是 header）

session,ts,status,color
a,1,new,red
a,2,check,blue
a,3,new,green
a,4,amount,blue
a,5,end,blue
b,1,new,red
b,2,bottle,blue
b,3,end,blue
c,4,check,blue

我在编写满足以下条件的 sql 查询时遇到问题： 1) sessions 的所有行都包含新的状态。 2) 如果 sessions 包含 status=new 的多个值，只删除第一个

输出将是

a,1,new,red
a,2,check,blue
a,4,amount,blue
a,5,end,blue
b,1,new,red
b,2,bottle,blue
b,3,end,blue

行 a,3,new,green 和 c,4,check,blue 被省略。

我写了这个查询，如果您只查看 session、ts 和 status 列，它确实可以解决问题，但我不喜欢第二个查询，因为其中有一个 group-by

select  session, ts, status from mp_logon3
where status!='new'
and session in (select distinct a.session from mp_logon3 a 
where a.status = 'new'
) 
union
select session, min(ts), status from mp_logon3
where status='new'
and session in (select distinct b.session from mp_logon3 b
where b.status = 'new'
)
group by session, status

但是，一旦添加 color 列，它就会分崩离析。（你得到了 session=a 和 status=new 的两行。一个是绿色，一个是红色。

select  session, ts, status, flavor from mp_logon3
where status!='new'
and session in (select distinct a.session from mp_logon3 a 
where a.status = 'new'
) 
union
select session, min(ts), status, flavor from mp_logon3
where status='new'
and session in (select distinct b.session from mp_logon3 b
where b.status = 'new'
)
group by session, status, flavor

最后，有没有更好的方法将这个查询作为一个整体来写。也许没有工会？

Answer 1

如果使用 Teradata SQL：

select  session, ts, status, color
from mp_logon3
where status='new'
and session in (select distinct a.session from mp_logon3 a 
where a.status = 'new'
) 
qualify row_number() over (partition by session,status order by ts)=1
union
select  session, ts, status, flavor from mp_logon3
where status!='new'
and session in (select distinct a.session from mp_logon3 a 
where a.status = 'new'
)

Answer 2

这是针对您的问题的 HiveQL 解决方案

WITH sessions
AS (SELECT DISTINCT session
    FROM mp_logon3
    WHERE STATUS = 'new')
,logons
AS (SELECT session
        ,ts
        ,STATUS
        ,color
        ,row_number() OVER (
            PARTITION BY session
            ,STATUS ORDER BY ts
            ) AS r_num
    FROM mp_logon3)
SELECT l.*
FROM logons l
INNER JOIN sessions s ON (s.session = l.session)
WHERE l.STATUS <> 'new'
    OR l.r_num = 1
ORDER BY l.session
    ,l.ts;

Hive SQL：过滤掉包含特定列重复值的行

Hive SQL: Filtering out rows that contain duplicate values for a specific column

sql

filter

hiveql