如何使用 cli 从 hive table 获取最小和最大分区值?
How to get min and max partition values from hive table using cli?
我在配置单元中有多种 tables,最少 0 到最多 4 个分区列。
下面是几个 table 的 HDFS 表示,分区范围从 0 到 4。
-- type-0 <no partitions>
hdfs://ns/user/abc/warehouse/test_db/test_tbl_0/__SNAPPY.gz
-- type-1 <1 partition column in table = dt>
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-14/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-30/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-12-16/__SNAPPY.gz
-- type-2 <2 partition columns in table = dt, hh>
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-14/hh=01/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-15/hh=02/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-12-19/hh=03/__SNAPPY.gz
-- type-3 <3 partition columns in table = client, dt, hh>
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-29/hh=01/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-12-20/hh=04/__SNAPPY.gz
-- type-4 <4 partition columns in table = service, geo, dt, hh>
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-20/hh=01/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-12-13/hh=21/__SNAPPY.gz
根据 markp-fuso 的要求,类型 0 到 4 的预期输出
DBName TableName MIN_PARTITION(s) MAX_PARTITION(s)
test_db test_tbl_0
test_db_a test_tbl_1 dt=2020-11-14 dt=2020-12-16
test_db_b test_tbl_2 dt=2020-11-14/hh=01 dt=2020-12-19/hh=03/
test_db_c test_tbl_3 client=cobra/dt=2020-11-14/hh=01 client=cobra/dt=2020-12-20/hh=04
test_db_d test_tbl_4 service=mobile/geo=us/dt=2020-11-14/hh=01 service=mobile/geo=us/dt=2020-12-13/hh=21
下面是我为 type-2
尝试的结果。
## Getting Minimum and Maximum partition lines,
### here i am removing lines of hdfs output like 'Found 20 items'>
hdfs dfs -ls 'hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/*' | grep -v '^Found' | sort -k6,7 | awk '{print }' | (head -n1 && tail -n1)
/*
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2019-03-12/hh=00
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2021-07-28/hh=22
*/
## Here i am further trying to simplify the output
hdfs dfs -ls 'hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/*' | grep -v '^Found' | sort -k6,7 | awk '{print }' | (head -n1 && tail -n1) | awk -F'/' '{print $(NF-2),$(NF-1),$NF}' | sed ':a;N;$!ba;s/\n/ /g'
/*
test_tbl_2 dt=2019-03-12 hh=00 test_tbl_2 dt=2021-07-28 hh=22
*/
正如我们在上面看到的,我得到以下格式的输出。
TableName MIN_PARTITION(s) TableName MAX_PARTITION(s)
此外,我仅在具有 2 个分区的 table 上测试了上述方法,是否有任何通用的 bash
hack 可以提供以下格式且无论分区数量如何?
DBName TableName MIN_PARTITION(s) MAX_PARTITION(s)
更新:问题更新了更多样本输入以及匹配的(所需)输出
假设:
- 给定 db/table 对的输入在连续的行上,因此我们可以在耗尽给定 db/table 对的输入时生成输出(否则我们需要将所有数据存储在内存中- 例如,数组 - 然后在整个输入流耗尽后打印所有输出)
- 输出格式有 4 列:
DBName TableName MinPartition MaxPartition
- 如果一对 db/table 只有一行输入,则最小和最大列将包含相同的值
- 使用
/
作为字段分隔符,将忽略 'last' 字段(示例输入中的 __SNAPPY.gz
)
用于演示目的的示例输入:
$ cat hdfs.input
# no min/max for test_db/test_tbl_0
hdfs://ns/user/abc/warehouse/test_db/test_tbl_0/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-14/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-30/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-12-16/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-15/hh=02/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-12-19/hh=03/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-29/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-12-20/hh=04/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-20/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-12-13/hh=21/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2019-03-12/hh=00/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2021-07-28/hh=22/__SNAPPY.gz
# min=max for test_db_b/test_tbl_7
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_7/dt=2021-07-28/hh=22/__SNAPPY.gz
一个awk
想法:
awk -F'/' '
function printline() {
if ( dbname != "") print dbname, tabname, minpart, maxpart
minpart = maxpart = ""
}
/^hdfs/ { if ( != dbname || != tabname )
printline()
dbname =
tabname =
if ( == "" ) {
minpart = maxpart = ""
next
}
pfx = ""
currpart = ""
for (i=9; i<NF; i++) {
currpart = currpart pfx $i
pfx=FS
}
minpart = ( (minpart == "") || (currpart < minpart) ) ? currpart : minpart
maxpart = ( (maxpart == "") || (currpart > maxpart) ) ? currpart : maxpart
}
END { printline() }
' hdfs.input
这会生成:
test_db test_tbl_0
test_db_a test_tbl_1 dt=2020-11-14 dt=2020-12-16
test_db_b test_tbl_2 dt=2020-11-14/hh=01 dt=2020-12-19/hh=03
test_db_c test_tbl_3 client=cobra/dt=2020-11-14/hh=01 client=cobra/dt=2020-12-20/hh=04
test_db_d test_tbl_4 service=mobile/geo=us/dt=2020-11-14/hh=01 service=mobile/geo=us/dt=2020-12-13/hh=21
test_db_b test_tbl_2 dt=2019-03-12/hh=00 dt=2021-07-28/hh=22
test_db_b test_tbl_7 dt=2021-07-28/hh=22 dt=2021-07-28/hh=22
我在配置单元中有多种 tables,最少 0 到最多 4 个分区列。
下面是几个 table 的 HDFS 表示,分区范围从 0 到 4。
-- type-0 <no partitions>
hdfs://ns/user/abc/warehouse/test_db/test_tbl_0/__SNAPPY.gz
-- type-1 <1 partition column in table = dt>
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-14/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-30/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-12-16/__SNAPPY.gz
-- type-2 <2 partition columns in table = dt, hh>
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-14/hh=01/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-15/hh=02/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-12-19/hh=03/__SNAPPY.gz
-- type-3 <3 partition columns in table = client, dt, hh>
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-29/hh=01/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-12-20/hh=04/__SNAPPY.gz
-- type-4 <4 partition columns in table = service, geo, dt, hh>
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-20/hh=01/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-12-13/hh=21/__SNAPPY.gz
根据 markp-fuso 的要求,类型 0 到 4 的预期输出
DBName TableName MIN_PARTITION(s) MAX_PARTITION(s)
test_db test_tbl_0
test_db_a test_tbl_1 dt=2020-11-14 dt=2020-12-16
test_db_b test_tbl_2 dt=2020-11-14/hh=01 dt=2020-12-19/hh=03/
test_db_c test_tbl_3 client=cobra/dt=2020-11-14/hh=01 client=cobra/dt=2020-12-20/hh=04
test_db_d test_tbl_4 service=mobile/geo=us/dt=2020-11-14/hh=01 service=mobile/geo=us/dt=2020-12-13/hh=21
下面是我为 type-2