Hive SQL:过滤掉包含特定列重复值的行

Hive SQL: Filtering out rows that contain duplicate values for a specific column

我有一个包含以下数据的配置单元 table:(第一行是 header)

session,ts,status,color
a,1,new,red
a,2,check,blue
a,3,new,green
a,4,amount,blue
a,5,end,blue
b,1,new,red
b,2,bottle,blue
b,3,end,blue
c,4,check,blue

我在编写满足以下条件的 sql 查询时遇到问题: 1) sessions 的所有行都包含新的状态。 2) 如果 sessions 包含 status=new 的多个值,只删除第一个

输出将是

a,1,new,red
a,2,check,blue
a,4,amount,blue
a,5,end,blue
b,1,new,red
b,2,bottle,blue
b,3,end,blue

a,3,new,greenc,4,check,blue 被省略。

我写了这个查询,如果您只查看 sessiontsstatus 列,它确实可以解决问题,但我不喜欢第二个查询,因为其中有一个 group-by

select  session, ts, status from mp_logon3
where status!='new'
and session in (select distinct a.session from mp_logon3 a 
where a.status = 'new'
) 
union
select session, min(ts), status from mp_logon3
where status='new'
and session in (select distinct b.session from mp_logon3 b
where b.status = 'new'
)
group by session, status 

但是,一旦添加 color 列,它就会分崩离析。 (你得到了 session=a 和 status=new 的两行。一个是绿色,一个是红色。

select  session, ts, status, flavor from mp_logon3
where status!='new'
and session in (select distinct a.session from mp_logon3 a 
where a.status = 'new'
) 
union
select session, min(ts), status, flavor from mp_logon3
where status='new'
and session in (select distinct b.session from mp_logon3 b
where b.status = 'new'
)
group by session, status, flavor

最后,有没有更好的方法将这个查询作为一个整体来写。也许没有工会?

如果使用 Teradata SQL:

select  session, ts, status, color
from mp_logon3
where status='new'
and session in (select distinct a.session from mp_logon3 a 
where a.status = 'new'
) 
qualify row_number() over (partition by session,status order by ts)=1
union
select  session, ts, status, flavor from mp_logon3
where status!='new'
and session in (select distinct a.session from mp_logon3 a 
where a.status = 'new'
) 

这是针对您的问题的 HiveQL 解决方案

WITH sessions
AS (SELECT DISTINCT session
    FROM mp_logon3
    WHERE STATUS = 'new')
,logons
AS (SELECT session
        ,ts
        ,STATUS
        ,color
        ,row_number() OVER (
            PARTITION BY session
            ,STATUS ORDER BY ts
            ) AS r_num
    FROM mp_logon3)
SELECT l.*
FROM logons l
INNER JOIN sessions s ON (s.session = l.session)
WHERE l.STATUS <> 'new'
    OR l.r_num = 1
ORDER BY l.session
    ,l.ts;