如何使用 cli 从 hive table 获取最小和最大分区值?

How to get min and max partition values from hive table using cli?

我在配置单元中有多种 tables,最少 0 到最多 4 个分区列。

下面是几个 table 的 HDFS 表示,分区范围从 0 到 4。

-- type-0 <no partitions>
hdfs://ns/user/abc/warehouse/test_db/test_tbl_0/__SNAPPY.gz


-- type-1 <1 partition column in table  = dt>
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-14/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-30/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-12-16/__SNAPPY.gz


-- type-2 <2 partition columns in table = dt, hh>
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-14/hh=01/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-15/hh=02/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-12-19/hh=03/__SNAPPY.gz


-- type-3 <3 partition columns in table = client, dt, hh>
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-29/hh=01/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-12-20/hh=04/__SNAPPY.gz


-- type-4 <4 partition columns in table = service, geo, dt, hh>
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-14/hh=01/__SNAPPY.gz   
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-20/hh=01/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-12-13/hh=21/__SNAPPY.gz

根据 markp-fuso 的要求,类型 0 到 4 的预期输出

DBName  TableName  MIN_PARTITION(s) MAX_PARTITION(s)
test_db    test_tbl_0

test_db_a  test_tbl_1 dt=2020-11-14    dt=2020-12-16

test_db_b  test_tbl_2 dt=2020-11-14/hh=01 dt=2020-12-19/hh=03/

test_db_c  test_tbl_3 client=cobra/dt=2020-11-14/hh=01 client=cobra/dt=2020-12-20/hh=04

test_db_d  test_tbl_4 service=mobile/geo=us/dt=2020-11-14/hh=01 service=mobile/geo=us/dt=2020-12-13/hh=21

下面是我为 type-2

尝试的结果。

## Getting Minimum and Maximum partition lines, 
### here i am removing lines of hdfs output like 'Found 20 items'>
hdfs dfs -ls 'hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/*' | grep -v '^Found' | sort -k6,7 | awk '{print }' | (head -n1 && tail -n1)
/*
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2019-03-12/hh=00
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2021-07-28/hh=22
*/

## Here i am further trying to simplify the output 
hdfs dfs -ls 'hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/*' | grep -v '^Found' | sort -k6,7 | awk '{print }' | (head -n1 && tail -n1) | awk -F'/' '{print $(NF-2),$(NF-1),$NF}' | sed ':a;N;$!ba;s/\n/ /g'
/*
test_tbl_2 dt=2019-03-12 hh=00  test_tbl_2 dt=2021-07-28 hh=22
*/

正如我们在上面看到的,我得到以下格式的输出。

TableName  MIN_PARTITION(s)  TableName  MAX_PARTITION(s)   

此外,我仅在具有 2 个分区的 table 上测试了上述方法,是否有任何通用的 bash hack 可以提供以下格式且无论分区数量如何?

DBName  TableName  MIN_PARTITION(s) MAX_PARTITION(s)  

更新:问题更新了更多样本输入以及匹配的(所需)输出

假设:

  • 给定 db/table 对的输入在连续的行上,因此我们可以在耗尽给定 db/table 对的输入时生成输出(否则我们需要将所有数据存储在内存中- 例如,数组 - 然后在整个输入流耗尽后打印所有输出)
  • 输出格式有 4 列:DBName TableName MinPartition MaxPartition
  • 如果一对 db/table 只有一行输入,则最小和最大列将包含相同的值
  • 使用 / 作为字段分隔符,将忽略 'last' 字段(示例输入中的 __SNAPPY.gz

用于演示目的的示例输入:

$ cat hdfs.input
# no min/max for test_db/test_tbl_0

hdfs://ns/user/abc/warehouse/test_db/test_tbl_0/__SNAPPY.gz

hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-14/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-30/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-12-16/__SNAPPY.gz

hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-15/hh=02/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-12-19/hh=03/__SNAPPY.gz

hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-29/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-12-20/hh=04/__SNAPPY.gz

hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-20/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-12-13/hh=21/__SNAPPY.gz

hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2019-03-12/hh=00/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2021-07-28/hh=22/__SNAPPY.gz

# min=max for test_db_b/test_tbl_7

hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_7/dt=2021-07-28/hh=22/__SNAPPY.gz

一个awk想法:

awk -F'/' '
function printline() {
        if ( dbname != "") print dbname, tabname, minpart, maxpart
        minpart = maxpart = ""
}
/^hdfs/ { if (  != dbname ||  != tabname )
             printline()
          dbname = 
          tabname = 
          if (  == "" ) {
             minpart = maxpart = ""
             next
          }
          pfx = ""
          currpart = ""
          for (i=9; i<NF; i++) {
              currpart = currpart pfx $i
              pfx=FS
          }
          minpart = ( (minpart == "") || (currpart < minpart) ) ? currpart : minpart
          maxpart = ( (maxpart == "") || (currpart > maxpart) ) ? currpart : maxpart
        }
END     { printline() }
' hdfs.input

这会生成:

test_db test_tbl_0
test_db_a test_tbl_1 dt=2020-11-14 dt=2020-12-16
test_db_b test_tbl_2 dt=2020-11-14/hh=01 dt=2020-12-19/hh=03
test_db_c test_tbl_3 client=cobra/dt=2020-11-14/hh=01 client=cobra/dt=2020-12-20/hh=04
test_db_d test_tbl_4 service=mobile/geo=us/dt=2020-11-14/hh=01 service=mobile/geo=us/dt=2020-12-13/hh=21
test_db_b test_tbl_2 dt=2019-03-12/hh=00 dt=2021-07-28/hh=22
test_db_b test_tbl_7 dt=2021-07-28/hh=22 dt=2021-07-28/hh=22