使用 Window 函数计算 Hive 中的滚动每周支出
Calculating Rolling Weekly Spend in Hive using Window Functions
我需要制定客户周长支出的分布。每次客户购买时,我都想知道他们在过去一周在我们这里花了多少钱。我想用我的 Hive 代码来做这个。
我的数据集和这个有点类似:
Spend_Table
Cust_ID | Purch_Date | Purch_Amount
1 | 1/1/19 |
1 | 1/2/19 |
1 | 1/3/19 |
1 | 1/4/19 |
1 | 1/5/19 |
1 | 1/6/19 |
1 | 1/7/19 |
2 | 1/1/19 |
2 | 1/2/19 |
2 | 1/3/19 |
2 | 1/5/19 |
2 | 1/7/19 |
2 | 1/9/19 |
2 | 1/11/19 |
到目前为止,我已经尝试过类似这样的代码:
Select Cust_ID,
Purch_Date,
Purch_Amount,
sum(Purch_Amount) over (partition by Cust_ID order by unix_timestamp(Purch_Date) range between 604800 and current row) as Rolling_Spend
from Spend_Table
Cust_ID | Purch_Date | Purch_Amount | Rolling_Spend
1 | 1/1/19 | |
1 | 1/2/19 | |
1 | 1/3/19 | |
1 | 1/4/19 | |
1 | 1/5/19 | |
1 | 1/6/19 | | 4
1 | 1/7/19 | | 5
2 | 1/1/19 | |
2 | 1/2/19 | |
2 | 1/3/19 | |
2 | 1/5/19 | | 8
2 | 1/7/19 | | 0
2 | 1/9/19 | | 8
2 | 1/11/19 | | 8
我认为问题出在我的范围之间,因为它似乎占用了前面的行数。我期待它在之前的秒数内获取数据(604800 是 6 天秒)。
我想做的事情可行吗?我做不到前 6 行,因为不是每个客户每天都会购买,就像客户 2 一样。非常感谢任何帮助!
SELECT *, sum(some_value) OVER (
PARTITION BY Cust_ID
ORDER BY CAST(Purch_Date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS cummulativeSum FROM Spend_Table
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
将问题的答案移至此处,
I was able to get the original code to work by changing it to:
Select Cust_ID,
Purch_Date,
Purch_Amount,
sum(Purch_Amount) over (partition by Cust_ID order by unix_timestamp(Purch_Date, 'MM-dd-yyyy') range between 604800 and
current row) as Rolling_Spend
from Spend_Table
The key was specifying the date format in the unix_timestamp formula
我需要制定客户周长支出的分布。每次客户购买时,我都想知道他们在过去一周在我们这里花了多少钱。我想用我的 Hive 代码来做这个。
我的数据集和这个有点类似:
Spend_Table
Cust_ID | Purch_Date | Purch_Amount
1 | 1/1/19 |
1 | 1/2/19 |
1 | 1/3/19 |
1 | 1/4/19 |
1 | 1/5/19 |
1 | 1/6/19 |
1 | 1/7/19 |
2 | 1/1/19 |
2 | 1/2/19 |
2 | 1/3/19 |
2 | 1/5/19 |
2 | 1/7/19 |
2 | 1/9/19 |
2 | 1/11/19 |
到目前为止,我已经尝试过类似这样的代码:
Select Cust_ID,
Purch_Date,
Purch_Amount,
sum(Purch_Amount) over (partition by Cust_ID order by unix_timestamp(Purch_Date) range between 604800 and current row) as Rolling_Spend
from Spend_Table
Cust_ID | Purch_Date | Purch_Amount | Rolling_Spend
1 | 1/1/19 | |
1 | 1/2/19 | |
1 | 1/3/19 | |
1 | 1/4/19 | |
1 | 1/5/19 | |
1 | 1/6/19 | | 4
1 | 1/7/19 | | 5
2 | 1/1/19 | |
2 | 1/2/19 | |
2 | 1/3/19 | |
2 | 1/5/19 | | 8
2 | 1/7/19 | | 0
2 | 1/9/19 | | 8
2 | 1/11/19 | | 8
我认为问题出在我的范围之间,因为它似乎占用了前面的行数。我期待它在之前的秒数内获取数据(604800 是 6 天秒)。
我想做的事情可行吗?我做不到前 6 行,因为不是每个客户每天都会购买,就像客户 2 一样。非常感谢任何帮助!
SELECT *, sum(some_value) OVER (
PARTITION BY Cust_ID
ORDER BY CAST(Purch_Date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS cummulativeSum FROM Spend_Table
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
将问题的答案移至此处,
I was able to get the original code to work by changing it to:
Select Cust_ID, Purch_Date, Purch_Amount, sum(Purch_Amount) over (partition by Cust_ID order by unix_timestamp(Purch_Date, 'MM-dd-yyyy') range between 604800 and
current row) as Rolling_Spend from Spend_Table
The key was specifying the date format in the unix_timestamp formula