从 Hive 表中获取重复行的差异
Getting set difference with duplicate rows from Hive tables
我有两个 Hive 表:Table1、Table2。表 1 有重复行,表 2 没有。我想从 Table1 中获取 Table2 中不存在的缺失数据,包括重复项。如何使用 Hive 查询语言完成此操作?
示例:
表1数据:
Col1,Col2
A1,V1
A1,V1
A2,V2
A3,V3
A3,V3
A3,V3
A4,V4
表2数据:
Col1,Col2
A1,V1
A2,V2
A3,V3
我想从表 1 中获取以下缺失数据:
Col1,Col2
A1,V1
A3,V3
A3,V3
A4,V4
你可以使用类似的东西:
with t1 as (
select 'A1' col1,'V1' col2 union all
select 'A1' col1,'V1' col2 union all
select 'A2' col1,'V2' col2 union all
select 'A3' col1,'V3' col2 union all
select 'A3' col1,'V3' col2 union all
select 'A3' col1,'V3' col2 union all
select 'A4' col1,'V4' col2
),
t2 as (
select 'A1' col1,'V1' col2 union all
select 'A2' col1,'V2' col2 union all
select 'A3' col1,'V3' col2
),
t1_with_rn as (
select t1.*, row_number() over(partition by t1.col1, t1.col2) rn from t1
)
select
t1_with_rn.col1, t1_with_rn.col2
from
t1_with_rn
left join t2 on (t1_with_rn.col1 = t2.col1 and t1_with_rn.col2 = t2.col2 and t1_with_rn.rn = 1)
where
t2.col1 is null and t2.col2 is null
我有两个 Hive 表:Table1、Table2。表 1 有重复行,表 2 没有。我想从 Table1 中获取 Table2 中不存在的缺失数据,包括重复项。如何使用 Hive 查询语言完成此操作?
示例:
表1数据:
Col1,Col2
A1,V1
A1,V1
A2,V2
A3,V3
A3,V3
A3,V3
A4,V4
表2数据:
Col1,Col2
A1,V1
A2,V2
A3,V3
我想从表 1 中获取以下缺失数据:
Col1,Col2
A1,V1
A3,V3
A3,V3
A4,V4
你可以使用类似的东西:
with t1 as (
select 'A1' col1,'V1' col2 union all
select 'A1' col1,'V1' col2 union all
select 'A2' col1,'V2' col2 union all
select 'A3' col1,'V3' col2 union all
select 'A3' col1,'V3' col2 union all
select 'A3' col1,'V3' col2 union all
select 'A4' col1,'V4' col2
),
t2 as (
select 'A1' col1,'V1' col2 union all
select 'A2' col1,'V2' col2 union all
select 'A3' col1,'V3' col2
),
t1_with_rn as (
select t1.*, row_number() over(partition by t1.col1, t1.col2) rn from t1
)
select
t1_with_rn.col1, t1_with_rn.col2
from
t1_with_rn
left join t2 on (t1_with_rn.col1 = t2.col1 and t1_with_rn.col2 = t2.col2 and t1_with_rn.rn = 1)
where
t2.col1 is null and t2.col2 is null