Hive SQL:过滤掉包含特定列重复值的行
Hive SQL: Filtering out rows that contain duplicate values for a specific column
我有一个包含以下数据的配置单元 table:(第一行是 header)
session,ts,status,color
a,1,new,red
a,2,check,blue
a,3,new,green
a,4,amount,blue
a,5,end,blue
b,1,new,red
b,2,bottle,blue
b,3,end,blue
c,4,check,blue
我在编写满足以下条件的 sql 查询时遇到问题:
1) sessions 的所有行都包含新的状态。
2) 如果 sessions 包含 status=new 的多个值,只删除第一个
输出将是
a,1,new,red
a,2,check,blue
a,4,amount,blue
a,5,end,blue
b,1,new,red
b,2,bottle,blue
b,3,end,blue
行 a,3,new,green
和 c,4,check,blue
被省略。
我写了这个查询,如果您只查看 session
、ts
和 status
列,它确实可以解决问题,但我不喜欢第二个查询,因为其中有一个 group-by
select session, ts, status from mp_logon3
where status!='new'
and session in (select distinct a.session from mp_logon3 a
where a.status = 'new'
)
union
select session, min(ts), status from mp_logon3
where status='new'
and session in (select distinct b.session from mp_logon3 b
where b.status = 'new'
)
group by session, status
但是,一旦添加 color
列,它就会分崩离析。 (你得到了 session=a 和 status=new 的两行。一个是绿色,一个是红色。
select session, ts, status, flavor from mp_logon3
where status!='new'
and session in (select distinct a.session from mp_logon3 a
where a.status = 'new'
)
union
select session, min(ts), status, flavor from mp_logon3
where status='new'
and session in (select distinct b.session from mp_logon3 b
where b.status = 'new'
)
group by session, status, flavor
最后,有没有更好的方法将这个查询作为一个整体来写。也许没有工会?
如果使用 Teradata SQL:
select session, ts, status, color
from mp_logon3
where status='new'
and session in (select distinct a.session from mp_logon3 a
where a.status = 'new'
)
qualify row_number() over (partition by session,status order by ts)=1
union
select session, ts, status, flavor from mp_logon3
where status!='new'
and session in (select distinct a.session from mp_logon3 a
where a.status = 'new'
)
这是针对您的问题的 HiveQL 解决方案
WITH sessions
AS (SELECT DISTINCT session
FROM mp_logon3
WHERE STATUS = 'new')
,logons
AS (SELECT session
,ts
,STATUS
,color
,row_number() OVER (
PARTITION BY session
,STATUS ORDER BY ts
) AS r_num
FROM mp_logon3)
SELECT l.*
FROM logons l
INNER JOIN sessions s ON (s.session = l.session)
WHERE l.STATUS <> 'new'
OR l.r_num = 1
ORDER BY l.session
,l.ts;
我有一个包含以下数据的配置单元 table:(第一行是 header)
session,ts,status,color
a,1,new,red
a,2,check,blue
a,3,new,green
a,4,amount,blue
a,5,end,blue
b,1,new,red
b,2,bottle,blue
b,3,end,blue
c,4,check,blue
我在编写满足以下条件的 sql 查询时遇到问题: 1) sessions 的所有行都包含新的状态。 2) 如果 sessions 包含 status=new 的多个值,只删除第一个
输出将是
a,1,new,red
a,2,check,blue
a,4,amount,blue
a,5,end,blue
b,1,new,red
b,2,bottle,blue
b,3,end,blue
行 a,3,new,green
和 c,4,check,blue
被省略。
我写了这个查询,如果您只查看 session
、ts
和 status
列,它确实可以解决问题,但我不喜欢第二个查询,因为其中有一个 group-by
select session, ts, status from mp_logon3
where status!='new'
and session in (select distinct a.session from mp_logon3 a
where a.status = 'new'
)
union
select session, min(ts), status from mp_logon3
where status='new'
and session in (select distinct b.session from mp_logon3 b
where b.status = 'new'
)
group by session, status
但是,一旦添加 color
列,它就会分崩离析。 (你得到了 session=a 和 status=new 的两行。一个是绿色,一个是红色。
select session, ts, status, flavor from mp_logon3
where status!='new'
and session in (select distinct a.session from mp_logon3 a
where a.status = 'new'
)
union
select session, min(ts), status, flavor from mp_logon3
where status='new'
and session in (select distinct b.session from mp_logon3 b
where b.status = 'new'
)
group by session, status, flavor
最后,有没有更好的方法将这个查询作为一个整体来写。也许没有工会?
如果使用 Teradata SQL:
select session, ts, status, color
from mp_logon3
where status='new'
and session in (select distinct a.session from mp_logon3 a
where a.status = 'new'
)
qualify row_number() over (partition by session,status order by ts)=1
union
select session, ts, status, flavor from mp_logon3
where status!='new'
and session in (select distinct a.session from mp_logon3 a
where a.status = 'new'
)
这是针对您的问题的 HiveQL 解决方案
WITH sessions
AS (SELECT DISTINCT session
FROM mp_logon3
WHERE STATUS = 'new')
,logons
AS (SELECT session
,ts
,STATUS
,color
,row_number() OVER (
PARTITION BY session
,STATUS ORDER BY ts
) AS r_num
FROM mp_logon3)
SELECT l.*
FROM logons l
INNER JOIN sessions s ON (s.session = l.session)
WHERE l.STATUS <> 'new'
OR l.r_num = 1
ORDER BY l.session
,l.ts;