使用 shell 脚本从配置单元查询的结果中查找字符串并提取值?

Find a string and extract values from result of hive query using shell script?

问题类似于: Find and Extract value after specific String from a file using bash shell script?

我正在从 shell 脚本执行配置单元查询,需要在变量中提取一些值,查询如下:

sql="show create table dev.emp"
partition_col= `beeline -u $Beeline_URL -e $sql` | grep 'PARTITIONED BY' | cut -d "'" -f2`

sql 查询的输出如下:

+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `dv.par_kst`(                |
|   `col1` string,                                   |
|   `col2` string,                                  |
|   `col3` string)                                  |
| PARTITIONED BY (                                   |
|   `part_col1` int,                                 |
|   `part_col2` int)                                 |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |
| LOCATION                                           |
|   'hdfs://nameservicets1/dv/hdfsdata/par_kst' |
| TBLPROPERTIES (                                    |
|   'spark.sql.create.version'='2.2 or prior',       |
|   'spark.sql.sources.schema.numPartCols'='2',      |
|   'spark.sql.sources.schema.numParts'='1',         |
|   'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"col1","type":"string","nullable":true,"metadata":{}},{"name":"col2","type":"string","nullable":true,"metadata":{}},{"name":"col3","type":"integer","nullable":true,"metadata":{}},{"name":"part_col2","type":"integer","nullable":true,"metadata":{}}]}',  |
|   'spark.sql.sources.schema.partCol.0'='part_col1', |
|   'spark.sql.sources.schema.partCol.1'='part_col2', |
|   'transient_lastDdlTime'='1587487456')            |
+----------------------------------------------------+

从上面 sql,我想提取 PARTITIONED BY 详细信息。

Desired output :

part_col1 , part_col2

尝试使用以下代码但未获得正确的值:

partition_col=`beeline -u $Beeline_URL -e $sql` | grep 'PARTITIONED BY' | cut -d "'" -f2`

并且这些 PARTITIONED BY 不是固定的,这意味着它可能包含 3 个或更多的其他文件,所以我想提取所有 PARTITIONED BY。

PARTITIONED BY 和 ROW FORMAT SERDE 之间的所有值,删除空格“`”和数据类型!

您可以使用 awk:

/PARTITIONED BY \(/  {partitioned_by = 1; next}
/ROW FORMAT SERDE/  {partitioned_by = 0; next}
partitioned_by == 1 {a[n++] = substr(, 2, length() - 2)}
END { for (i in a) printf "%s, ", i}

将以上内容存储在名为 beeline.awk 的文件中并执行:

partition_col=`beeline -u $Beeline_URL -e $sql` | awk -f beeline.awk

使用 sed

sed -n  '/PARTITIONED BY/,/ROW FORMAT SERD/p' file.txt | sed  '1d; $d' |  sed  -E 's/.*(`.*`).*//g' |  tr -d '`' | tr '\n' ','

演示:

$sed -n  '/PARTITIONED BY/,/ROW FORMAT SERD/p' file.txt | sed  '1d; $d' |  sed  -E 's/.*(`.*`).*//g' |  tr -d '`'  | tr '\n' ','
part_col1,part_col2,$
$

解释:

sed -n '/PARTITIONED BY/,/ROW FORMAT SERD/p' <--- 打印 2 个模式之间的线

sed '1d; $d' <-- 删除第一行和最后一行

sed -E 's/.*(.*).*//g' < -- 在```

之间打印字符串

tr -d ''` <-- 删除 ``` 字符

tr '\n' ',' <-- 用 ,

替换新行