HIVE - 根据日期计算 window 分区的统计信息
HIVE - compute statistics over partitions with window based on date
我见过类似我的问题的解决方案,但 none 对我来说很管用。我也相信应该有办法让它发挥作用。
给定 table 和
ID
Date
target
1
2020-01-01
1
1
2020-01-02
1
1
2020-01-03
0
1
2020-01-04
1
1
2020-01-04
0
1
2020-06-01
1
1
2020-06-02
1
1
2020-06-03
0
1
2020-06-04
1
1
2020-06-04
0
2
2020-01-01
1
ID为BIGINT,目标为Int,日期为DATE
我想为每个 ID/Date 计算日期前 3 个月和 12 个月(含)内同一 ID 的总和和行数。输出示例:
ID
Date
Sum_3
Count_3
Sum_12
Count_12
1
2020-01-01
1
1
1
1
1
2020-01-02
2
2
2
2
1
2020-01-03
2
3
2
3
1
2020-01-04
3
5
3
5
1
2020-06-01
1
1
4
6
1
2020-06-02
2
2
5
7
1
2020-06-03
2
3
6
8
1
2020-06-04
3
5
7
10
2
2020-01-01
1
1
1
1
如何在 HIVE 中获得这次结果?
我不确定是否应该使用分析函数(以及如何使用)、分组依据等...?
如果您可以接受月数的近似值作为天数,那么您可以在 Hive 中使用 window 函数:
select id, date,
count(*) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding -- 90 days
) as count_3,
sum(target) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding
) as sum_3,
count(*) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding -- 360 days
) as count_12,
sum(target) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding
) as sum_12
from mytable
您可以在同一查询中聚合:
select id, date,
sum(count(*)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding -- 90 days
) as count_3,
sum(sum(target)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding
) as sum_3,
sum(count(*)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding -- 360 days
) as count_12,
sum(sum(target)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding
) as sum_12
from mytable
group by id, date, unix_timestamp(date)
如果你可以估计间隔时间(1 个月 = 30 天):(GMB 答案的改进)
with t as (
select ID, Date,
sum(target) target,
count(target) c_target
from table
group by ID, Date
)
select ID, Date,
sum(target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 90 preceding
) sum_3,
sum(c_target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 90 preceding
) count_3,
sum(target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 360 preceding
) sum_12,
sum(c_target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 360 preceding
) count_12
from t
或者如果您想要精确的间隔,您可以进行自连接(但很昂贵):
with t as (
select ID, Date,
sum(target) target,
count(target) c_target
from table
group by ID, Date
)
select
t_3month.ID,
t_3month.Date,
t_3month.sum_3,
t_3month.count_3,
sum(t3.target) sum_12,
sum(t3.c_target) count_12
from (
select
t1.ID,
t1.Date,
sum(t2.target) sum_3,
sum(t2.c_target) count_3
from t t1
left join t t2
on t2.Date > t1.Date - interval 3 month and
t2.Date <= t1.Date and
t1.ID = t2.ID
group by t1.ID, t1.Date
) t_3month
left join t t3
on t3.Date > t_3month.Date - interval 12 month and
t3.Date <= t_3month.Date and
t_3month.ID = t3.ID
group by t_3month.ID, t_3month.Date, t_3month.sum_3, t_3month.count_3
order by ID, Date;
我见过类似我的问题的解决方案,但 none 对我来说很管用。我也相信应该有办法让它发挥作用。
给定 table 和
ID | Date | target |
---|---|---|
1 | 2020-01-01 | 1 |
1 | 2020-01-02 | 1 |
1 | 2020-01-03 | 0 |
1 | 2020-01-04 | 1 |
1 | 2020-01-04 | 0 |
1 | 2020-06-01 | 1 |
1 | 2020-06-02 | 1 |
1 | 2020-06-03 | 0 |
1 | 2020-06-04 | 1 |
1 | 2020-06-04 | 0 |
2 | 2020-01-01 | 1 |
ID为BIGINT,目标为Int,日期为DATE
我想为每个 ID/Date 计算日期前 3 个月和 12 个月(含)内同一 ID 的总和和行数。输出示例:
ID | Date | Sum_3 | Count_3 | Sum_12 | Count_12 |
---|---|---|---|---|---|
1 | 2020-01-01 | 1 | 1 | 1 | 1 |
1 | 2020-01-02 | 2 | 2 | 2 | 2 |
1 | 2020-01-03 | 2 | 3 | 2 | 3 |
1 | 2020-01-04 | 3 | 5 | 3 | 5 |
1 | 2020-06-01 | 1 | 1 | 4 | 6 |
1 | 2020-06-02 | 2 | 2 | 5 | 7 |
1 | 2020-06-03 | 2 | 3 | 6 | 8 |
1 | 2020-06-04 | 3 | 5 | 7 | 10 |
2 | 2020-01-01 | 1 | 1 | 1 | 1 |
如何在 HIVE 中获得这次结果? 我不确定是否应该使用分析函数(以及如何使用)、分组依据等...?
如果您可以接受月数的近似值作为天数,那么您可以在 Hive 中使用 window 函数:
select id, date,
count(*) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding -- 90 days
) as count_3,
sum(target) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding
) as sum_3,
count(*) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding -- 360 days
) as count_12,
sum(target) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding
) as sum_12
from mytable
您可以在同一查询中聚合:
select id, date,
sum(count(*)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding -- 90 days
) as count_3,
sum(sum(target)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding
) as sum_3,
sum(count(*)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding -- 360 days
) as count_12,
sum(sum(target)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding
) as sum_12
from mytable
group by id, date, unix_timestamp(date)
如果你可以估计间隔时间(1 个月 = 30 天):(GMB 答案的改进)
with t as (
select ID, Date,
sum(target) target,
count(target) c_target
from table
group by ID, Date
)
select ID, Date,
sum(target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 90 preceding
) sum_3,
sum(c_target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 90 preceding
) count_3,
sum(target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 360 preceding
) sum_12,
sum(c_target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 360 preceding
) count_12
from t
或者如果您想要精确的间隔,您可以进行自连接(但很昂贵):
with t as (
select ID, Date,
sum(target) target,
count(target) c_target
from table
group by ID, Date
)
select
t_3month.ID,
t_3month.Date,
t_3month.sum_3,
t_3month.count_3,
sum(t3.target) sum_12,
sum(t3.c_target) count_12
from (
select
t1.ID,
t1.Date,
sum(t2.target) sum_3,
sum(t2.c_target) count_3
from t t1
left join t t2
on t2.Date > t1.Date - interval 3 month and
t2.Date <= t1.Date and
t1.ID = t2.ID
group by t1.ID, t1.Date
) t_3month
left join t t3
on t3.Date > t_3month.Date - interval 12 month and
t3.Date <= t_3month.Date and
t_3month.ID = t3.ID
group by t_3month.ID, t_3month.Date, t_3month.sum_3, t_3month.count_3
order by ID, Date;