Hive 基于一列的自连接
Hive self join based on one column
我在 Hive 中有一个 table,其数据来自 SAP 系统。此 table 具有如下所示的列和数据:
+======================================================================+
|document_number | year | cost_centre | vendor_account_number | amount |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | | 123.5 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 25.96 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | | 586 |
+----------------------------------------------------------------------+
如上所示,vendor_account_number
列的值仅出现在 1 行中,我想将它放在所有其余行中。
预期输出如下:
+======================================================================+
|document_number | year | cost_centre | vendor_account_number | amount |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 123.5 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 25.96 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 586 |
+----------------------------------------------------------------------+
为了实现这一点,我在 Hive 中编写了以下 CTE
with non_blank_account_no as(
select document_number, vendor_account_number
from my_table
where vendor_account_number != ''
)
然后如下进行self left outer join:
select
a.document_number, a.year,
a.cost_centre, a.amount,
b.vendor_account_number
from my_table a
left outer join non_blank_account_no b on a.document_number = b.document_number
where a.document_number = ' '
但我得到了如下所示的重复输出
+======================================================================+
|document_number | year | cost_centre | vendor_account_number | amount |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 123.5 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 25.96 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 586 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 123.5 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 25.96 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 586 |
+----------------------------------------------------------------------+
任何人都可以帮助我了解我的 Hive 查询有什么问题吗?
在许多用例中,自连接可以替换为 windows 函数
select document_number
,year
,cost_center
,max (case when vendor_account_number <> '' then vendor_account_number end) over
(
partition by document_number
) as vendor_account_number
,amount
from my_table
我在 Hive 中有一个 table,其数据来自 SAP 系统。此 table 具有如下所示的列和数据:
+======================================================================+
|document_number | year | cost_centre | vendor_account_number | amount |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | | 123.5 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 25.96 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | | 586 |
+----------------------------------------------------------------------+
如上所示,vendor_account_number
列的值仅出现在 1 行中,我想将它放在所有其余行中。
预期输出如下:
+======================================================================+
|document_number | year | cost_centre | vendor_account_number | amount |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 123.5 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 25.96 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 586 |
+----------------------------------------------------------------------+
为了实现这一点,我在 Hive 中编写了以下 CTE
with non_blank_account_no as(
select document_number, vendor_account_number
from my_table
where vendor_account_number != ''
)
然后如下进行self left outer join:
select
a.document_number, a.year,
a.cost_centre, a.amount,
b.vendor_account_number
from my_table a
left outer join non_blank_account_no b on a.document_number = b.document_number
where a.document_number = ' '
但我得到了如下所示的重复输出
+======================================================================+
|document_number | year | cost_centre | vendor_account_number | amount |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 123.5 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 25.96 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 586 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 123.5 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 25.96 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 586 |
+----------------------------------------------------------------------+
任何人都可以帮助我了解我的 Hive 查询有什么问题吗?
在许多用例中,自连接可以替换为 windows 函数
select document_number
,year
,cost_center
,max (case when vendor_account_number <> '' then vendor_account_number end) over
(
partition by document_number
) as vendor_account_number
,amount
from my_table