Hive 基于一列的自连接

Question

我在 Hive 中有一个 table，其数据来自 SAP 系统。此 table 具有如下所示的列和数据：

+======================================================================+
|document_number | year | cost_centre | vendor_account_number | amount | 
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |                       |  123.5 |
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  25.96 |
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |                       |  586   |
+----------------------------------------------------------------------+

如上所示，vendor_account_number 列的值仅出现在 1 行中，我想将它放在所有其余行中。

预期输出如下：

+======================================================================+
|document_number | year | cost_centre | vendor_account_number | amount | 
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  123.5 |
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  25.96 |
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  586   |
+----------------------------------------------------------------------+

为了实现这一点，我在 Hive 中编写了以下 CTE

with non_blank_account_no as(
  select document_number, vendor_account_number
  from my_table
  where vendor_account_number != ''
)

然后如下进行self left outer join:

select 
    a.document_number, a.year, 
    a.cost_centre, a.amount,
    b.vendor_account_number
from my_table a
left outer join non_blank_account_no b on a.document_number = b.document_number
where a.document_number = ' '

但我得到了如下所示的重复输出

+======================================================================+
|document_number | year | cost_centre | vendor_account_number | amount | 
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  123.5 |
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  25.96 |
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  586   |
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  123.5 |
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  25.96 |
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  586   |
+----------------------------------------------------------------------+

任何人都可以帮助我了解我的 Hive 查询有什么问题吗？

Answer 1

在许多用例中，自连接可以替换为 windows 函数

select  document_number
       ,year
       ,cost_center

       ,max (case when vendor_account_number <> '' then vendor_account_number end) over 
        (
            partition by    document_number
        )                                       as vendor_account_number

       ,amount

from    my_table

Hive 基于一列的自连接

Hive self join based on one column

sap

hive

hiveql