使用 Hive 分区优化连接性能 table

Optimize the join performance with Hive partition table

我有一个带有一些示例数据的 Hive orc test_dev_db.TransactionUpdateTable table,它将保存需要更新到 main table (test_dev_db.TransactionMainHistoryTable) 的增量数据在 Country,Tran_date.

列上进行分区

Hive 增量加载table 架构:它包含需要合并的 19 行。

CREATE TABLE IF NOT EXISTS test_dev_db.TransactionUpdateTable 
(
Transaction_date timestamp,
Product       string,
Price         int,
Payment_Type  string,
Name          string, 
City          string,
State         string,
Country       string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS orc
;

Hive 主 table 模式:总行数 77。

CREATE TABLE IF NOT EXISTS test_dev_db.TransactionMainHistoryTable
(
Transaction_date timestamp,
Product       string,
Price         int,
Payment_Type  string,
Name          string,
City          string,
State         string
)
PARTITIONED BY (Country string,Tran_date string) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS orc
;

我在运行下方查询将增量数据与主table合并。

SELECT
  case when i.transaction_date is not null then cast(substring(current_timestamp(),0,19) as timestamp)  
  else t.transaction_date   end as transaction_date,
  t.product,
  case when i.price is not null then i.price else t.price end as price,
  t.payment_type,
  t.name,
  t.city,
  t.state,
  t.country,
  case when i.transaction_date is not null then substring(current_timestamp(),0,10) 
  else t.tran_date end as tran_date
  from
test_dev_db.TransactionMainHistoryTable t
full join test_dev_db.TransactionUpdateTable i on (t.Name=i.Name)
;
/hdfs/path/database/test_dev_db.db/transactionmainhistorytable/country=Australia/tran_date=2009-03-01
/hdfs/path/database/test_dev_db.db/transactionmainhistorytable/country=Australia/tran_date=2009-05-01

和运行下面查询过滤掉需要合并的特定分区,只是为了消除重写未更新的分区。

SELECT
  case when i.transaction_date is not null then cast(substring(current_timestamp(),0,19) as timestamp)  
  else t.transaction_date   end as transaction_date,
  t.product,
  case when i.price is not null then i.price else t.price end as price,
  t.payment_type,
  t.name,
  t.city,
  t.state,
  t.country,
  case when i.transaction_date is not null then substring(current_timestamp(),0,10) else t.tran_date end as tran_date
  from
(SELECT 
  *
  FROM 
test_dev_db.TransactionMainHistoryTable
where Tran_date in
(select distinct  from_unixtime(to_unix_timestamp (Transaction_date,'yyyy-MM-dd HH:mm'),'yyyy-MM-dd') from test_dev_db.TransactionUpdateTable
))t
full join test_dev_db.TransactionUpdateTable i on (t.Name=i.Name)
;

只有 Transaction_date,价格和分区列 tran_date 在这两种情况下都需要更新。两个查询 运行 都很好,尽管横向执行需要更长的时间。

分区 table 的执行计划为:

 Stage: Stage-5
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: transactionmainhistorytable
            filterExpr: tran_date is not null (type: boolean)
            Statistics: Num rows: 77 Data size: 39151 Basic stats: COMPLETE Column stats: COMPLETE
            Map Join Operator
              condition map:
                   Left Semi Join 0 to 1
              keys:
                0 tran_date (type: string)
                1 _col0 (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8

我第二个查询有问题吗?我是否需要同时使用分区列以进行更好的修剪。 非常感谢任何帮助或建议。

也许这不是一个完整的答案,但我希望这些想法会有用。

where tran_date IN (select ... )

实际上与

相同
LEFT SEMI JOIN (SELECT ...)

这反映在计划中:

Map Join Operator
              condition map:
                   Left Semi Join 0 to 1
              keys:
                0 tran_date (type: string)
                1 _col0 (type: string) 

并且执行为map-join。首先子查询数据集被 select 编辑,其次它被放置在分布式缓存中,加载到内存中以供 map-join 使用。所有这些步骤:select,加载到内存,map-join 比读取和覆盖所有 table 慢,因为它太小而且 over-partitioned:统计数据显示行数:77数据大小:39151 - 太小,无法按两列进行分区,甚至太小,根本无法进行分区。尝试更大 table 并使用 EXPLAIN EXTENDED 来检查真正被扫描的内容。

此外,替换为:

from_unixtime(to_unix_timestamp (Transaction_date,'yyyy-MM-dd HH:mm'),'yyyy-MM-dd')

substr(Transaction_date,0,10)date(Transaction_date)

substring(current_timestamp,0,10)current_date 只是为了简化代码。

如果您希望在计划中显示分区筛选器,请尝试将分区筛选器替换为分区列表,您可以在单独的会话中 select 并使用 shell 传递列表分区到 where 子句中,请参阅此答案: