使用 Window 函数计算 Hive 中的滚动每周支出

Question

我需要制定客户周长支出的分布。每次客户购买时，我都想知道他们在过去一周在我们这里花了多少钱。我想用我的 Hive 代码来做这个。

我的数据集和这个有点类似：

Spend_Table

Cust_ID | Purch_Date | Purch_Amount  
1 | 1/1/19 |   
1 | 1/2/19 |   
1 | 1/3/19 |   
1 | 1/4/19 |   
1 | 1/5/19 |   
1 | 1/6/19 |   
1 | 1/7/19 |   
2 | 1/1/19 |   
2 | 1/2/19 |   
2 | 1/3/19 |   
2 | 1/5/19 |   
2 | 1/7/19 |   
2 | 1/9/19 |   
2 | 1/11/19 |

到目前为止，我已经尝试过类似这样的代码：

Select Cust_ID, 
Purch_Date, 
Purch_Amount,
sum(Purch_Amount) over (partition by Cust_ID order by unix_timestamp(Purch_Date) range between 604800 and current row) as Rolling_Spend
from Spend_Table



Cust_ID | Purch_Date | Purch_Amount | Rolling_Spend  
1 | 1/1/19 |  |   
1 | 1/2/19 |  |   
1 | 1/3/19 |  |   
1 | 1/4/19 |  |   
1 | 1/5/19 |  |   
1 | 1/6/19 |  | 4  
1 | 1/7/19 |  | 5  
2 | 1/1/19 |  |   
2 | 1/2/19 |  |   
2 | 1/3/19 |  |   
2 | 1/5/19 |  | 8  
2 | 1/7/19 |  | 0  
2 | 1/9/19 |  | 8  
2 | 1/11/19 |  | 8

我认为问题出在我的范围之间，因为它似乎占用了前面的行数。我期待它在之前的秒数内获取数据（604800 是 6 天秒）。

我想做的事情可行吗？我做不到前 6 行，因为不是每个客户每天都会购买，就像客户 2 一样。非常感谢任何帮助！

Answer 1

SELECT *, sum(some_value) OVER (
        PARTITION BY Cust_ID 
        ORDER BY CAST(Purch_Date AS timestamp) 
        RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
     ) AS cummulativeSum FROM Spend_Table

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

Answer 2

将问题的答案移至此处，

I was able to get the original code to work by changing it to:
Select Cust_ID, 
Purch_Date, 
Purch_Amount,
sum(Purch_Amount) over (partition by Cust_ID order by unix_timestamp(Purch_Date, 'MM-dd-yyyy') range between 604800 and
current row) as Rolling_Spend from Spend_Table

The key was specifying the date format in the unix_timestamp formula

使用 Window 函数计算 Hive 中的滚动每周支出

Calculating Rolling Weekly Spend in Hive using Window Functions

hadoop

hive

window-functions

partition