使用 shell 脚本从配置单元查询的结果中查找字符串并提取值?
Find a string and extract values from result of hive query using shell script?
问题类似于:
Find and Extract value after specific String from a file using bash shell script?
我正在从 shell 脚本执行配置单元查询,需要在变量中提取一些值,查询如下:
sql="show create table dev.emp"
partition_col= `beeline -u $Beeline_URL -e $sql` | grep 'PARTITIONED BY' | cut -d "'" -f2`
sql 查询的输出如下:
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `dv.par_kst`( |
| `col1` string, |
| `col2` string, |
| `col3` string) |
| PARTITIONED BY ( |
| `part_col1` int, |
| `part_col2` int) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |
| LOCATION |
| 'hdfs://nameservicets1/dv/hdfsdata/par_kst' |
| TBLPROPERTIES ( |
| 'spark.sql.create.version'='2.2 or prior', |
| 'spark.sql.sources.schema.numPartCols'='2', |
| 'spark.sql.sources.schema.numParts'='1', |
| 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"col1","type":"string","nullable":true,"metadata":{}},{"name":"col2","type":"string","nullable":true,"metadata":{}},{"name":"col3","type":"integer","nullable":true,"metadata":{}},{"name":"part_col2","type":"integer","nullable":true,"metadata":{}}]}', |
| 'spark.sql.sources.schema.partCol.0'='part_col1', |
| 'spark.sql.sources.schema.partCol.1'='part_col2', |
| 'transient_lastDdlTime'='1587487456') |
+----------------------------------------------------+
从上面 sql,我想提取 PARTITIONED BY 详细信息。
Desired output :
part_col1 , part_col2
尝试使用以下代码但未获得正确的值:
partition_col=`beeline -u $Beeline_URL -e $sql` | grep 'PARTITIONED BY' | cut -d "'" -f2`
并且这些 PARTITIONED BY 不是固定的,这意味着它可能包含 3 个或更多的其他文件,所以我想提取所有 PARTITIONED BY。
PARTITIONED BY 和 ROW FORMAT SERDE 之间的所有值,删除空格“`”和数据类型!
您可以使用 awk
:
/PARTITIONED BY \(/ {partitioned_by = 1; next}
/ROW FORMAT SERDE/ {partitioned_by = 0; next}
partitioned_by == 1 {a[n++] = substr(, 2, length() - 2)}
END { for (i in a) printf "%s, ", i}
将以上内容存储在名为 beeline.awk
的文件中并执行:
partition_col=`beeline -u $Beeline_URL -e $sql` | awk -f beeline.awk
使用 sed
sed -n '/PARTITIONED BY/,/ROW FORMAT SERD/p' file.txt | sed '1d; $d' | sed -E 's/.*(`.*`).*//g' | tr -d '`' | tr '\n' ','
演示:
$sed -n '/PARTITIONED BY/,/ROW FORMAT SERD/p' file.txt | sed '1d; $d' | sed -E 's/.*(`.*`).*//g' | tr -d '`' | tr '\n' ','
part_col1,part_col2,$
$
解释:
sed -n '/PARTITIONED BY/,/ROW FORMAT SERD/p'
<--- 打印 2 个模式之间的线
sed '1d; $d'
<-- 删除第一行和最后一行
sed -E 's/.*(
.*).*//g'
< -- 在```
之间打印字符串
tr -d '
'` <-- 删除 ``` 字符
tr '\n' ','
<-- 用 ,
替换新行
问题类似于: Find and Extract value after specific String from a file using bash shell script?
我正在从 shell 脚本执行配置单元查询,需要在变量中提取一些值,查询如下:
sql="show create table dev.emp"
partition_col= `beeline -u $Beeline_URL -e $sql` | grep 'PARTITIONED BY' | cut -d "'" -f2`
sql 查询的输出如下:
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `dv.par_kst`( |
| `col1` string, |
| `col2` string, |
| `col3` string) |
| PARTITIONED BY ( |
| `part_col1` int, |
| `part_col2` int) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |
| LOCATION |
| 'hdfs://nameservicets1/dv/hdfsdata/par_kst' |
| TBLPROPERTIES ( |
| 'spark.sql.create.version'='2.2 or prior', |
| 'spark.sql.sources.schema.numPartCols'='2', |
| 'spark.sql.sources.schema.numParts'='1', |
| 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"col1","type":"string","nullable":true,"metadata":{}},{"name":"col2","type":"string","nullable":true,"metadata":{}},{"name":"col3","type":"integer","nullable":true,"metadata":{}},{"name":"part_col2","type":"integer","nullable":true,"metadata":{}}]}', |
| 'spark.sql.sources.schema.partCol.0'='part_col1', |
| 'spark.sql.sources.schema.partCol.1'='part_col2', |
| 'transient_lastDdlTime'='1587487456') |
+----------------------------------------------------+
从上面 sql,我想提取 PARTITIONED BY 详细信息。
Desired output :
part_col1 , part_col2
尝试使用以下代码但未获得正确的值:
partition_col=`beeline -u $Beeline_URL -e $sql` | grep 'PARTITIONED BY' | cut -d "'" -f2`
并且这些 PARTITIONED BY 不是固定的,这意味着它可能包含 3 个或更多的其他文件,所以我想提取所有 PARTITIONED BY。
PARTITIONED BY 和 ROW FORMAT SERDE 之间的所有值,删除空格“`”和数据类型!
您可以使用 awk
:
/PARTITIONED BY \(/ {partitioned_by = 1; next}
/ROW FORMAT SERDE/ {partitioned_by = 0; next}
partitioned_by == 1 {a[n++] = substr(, 2, length() - 2)}
END { for (i in a) printf "%s, ", i}
将以上内容存储在名为 beeline.awk
的文件中并执行:
partition_col=`beeline -u $Beeline_URL -e $sql` | awk -f beeline.awk
使用 sed
sed -n '/PARTITIONED BY/,/ROW FORMAT SERD/p' file.txt | sed '1d; $d' | sed -E 's/.*(`.*`).*//g' | tr -d '`' | tr '\n' ','
演示:
$sed -n '/PARTITIONED BY/,/ROW FORMAT SERD/p' file.txt | sed '1d; $d' | sed -E 's/.*(`.*`).*//g' | tr -d '`' | tr '\n' ','
part_col1,part_col2,$
$
解释:
sed -n '/PARTITIONED BY/,/ROW FORMAT SERD/p'
<--- 打印 2 个模式之间的线
sed '1d; $d'
<-- 删除第一行和最后一行
sed -E 's/.*(
.*).*//g'
< -- 在```
tr -d '
'` <-- 删除 ``` 字符
tr '\n' ','
<-- 用 ,