HIVE - 根据日期计算 window 分区的统计信息

HIVE - compute statistics over partitions with window based on date

我见过类似我的问题的解决方案,但 none 对我来说很管用。我也相信应该有办法让它发挥作用。

给定 table 和

ID Date target
1 2020-01-01 1
1 2020-01-02 1
1 2020-01-03 0
1 2020-01-04 1
1 2020-01-04 0
1 2020-06-01 1
1 2020-06-02 1
1 2020-06-03 0
1 2020-06-04 1
1 2020-06-04 0
2 2020-01-01 1

ID为BIGINT,目标为Int,日期为DATE

我想为每个 ID/Date 计算日期前 3 个月和 12 个月(含)内同一 ID 的总和和行数。输出示例:

ID Date Sum_3 Count_3 Sum_12 Count_12
1 2020-01-01 1 1 1 1
1 2020-01-02 2 2 2 2
1 2020-01-03 2 3 2 3
1 2020-01-04 3 5 3 5
1 2020-06-01 1 1 4 6
1 2020-06-02 2 2 5 7
1 2020-06-03 2 3 6 8
1 2020-06-04 3 5 7 10
2 2020-01-01 1 1 1 1

如何在 HIVE 中获得这次结果? 我不确定是否应该使用分析函数(以及如何使用)、分组依据等...?

如果您可以接受月数的近似值作为天数,那么您可以在 Hive 中使用 window 函数:

select id, date, 
    count(*) over(
        partition by id 
        order by unix_timestamp(date)
        range 60 * 60 * 24 * 90 preceding -- 90 days
    ) as count_3,
    sum(target) over(
        partition by id 
        order by unix_timestamp(date)
        range 60 * 60 * 24 * 90 preceding
    ) as sum_3,
    count(*) over(
        partition by id 
        order by unix_timestamp(date)
        range 60 * 60 * 24 * 360 preceding -- 360 days
    ) as count_12,
    sum(target) over(
        partition by id 
        order by unix_timestamp(date)
        range 60 * 60 * 24 * 360 preceding
    ) as sum_12
from mytable

您可以在同一查询中聚合:

select id, date, 
    sum(count(*)) over(
        partition by id 
        order by unix_timestamp(date)
        range 60 * 60 * 24 * 90 preceding -- 90 days
    ) as count_3,
    sum(sum(target)) over(
        partition by id 
        order by unix_timestamp(date)
        range 60 * 60 * 24 * 90 preceding
    ) as sum_3,
    sum(count(*)) over(
        partition by id 
        order by unix_timestamp(date)
        range 60 * 60 * 24 * 360 preceding -- 360 days
    ) as count_12,
    sum(sum(target)) over(
        partition by id 
        order by unix_timestamp(date)
        range 60 * 60 * 24 * 360 preceding
    ) as sum_12
from mytable
group by id, date, unix_timestamp(date)

如果你可以估计间隔时间(1 个月 = 30 天):(GMB 答案的改进)

with t as (
    select ID, Date,
        sum(target) target,
        count(target) c_target
    from table
    group by ID, Date
)
select ID, Date,
    sum(target) over(
        partition by ID
        order by unix_timestamp(Date, 'yyyy-MM-dd')
        range 60 * 60 * 24 * 90 preceding
    ) sum_3,
    sum(c_target) over(
        partition by ID
        order by unix_timestamp(Date, 'yyyy-MM-dd')
        range 60 * 60 * 24 * 90 preceding
    ) count_3,
    sum(target) over(
        partition by ID
        order by unix_timestamp(Date, 'yyyy-MM-dd')
        range 60 * 60 * 24 * 360 preceding
    ) sum_12,
    sum(c_target) over(
        partition by ID
        order by unix_timestamp(Date, 'yyyy-MM-dd')
        range 60 * 60 * 24 * 360 preceding
    ) count_12
from t

或者如果您想要精确的间隔,您可以进行自连接(但很昂贵):

with t as (
    select ID, Date,
        sum(target) target,
        count(target) c_target
    from table
    group by ID, Date
)
select
    t_3month.ID, 
    t_3month.Date, 
    t_3month.sum_3, 
    t_3month.count_3, 
    sum(t3.target) sum_12, 
    sum(t3.c_target) count_12
from (
    select 
        t1.ID, 
        t1.Date,
        sum(t2.target) sum_3,
        sum(t2.c_target) count_3
    from t t1
    left join t t2
    on t2.Date > t1.Date - interval 3 month and
       t2.Date <= t1.Date and
       t1.ID = t2.ID
    group by t1.ID, t1.Date
) t_3month
left join t t3
on t3.Date > t_3month.Date - interval 12 month and
   t3.Date <= t_3month.Date and
   t_3month.ID = t3.ID
group by t_3month.ID, t_3month.Date, t_3month.sum_3, t_3month.count_3
order by ID, Date;