Bash 文本解析 HIVE DDL
Bash Text Parsing a HIVE DDL
我有一组重复的下面的文本,它们本质上是描述来自 2 个 hadoop hive 表 tablea1 和 tablea2 的扩展输出并显示其属性。
Detailed Table Information Table(tableName:tablea1, dbName:default, owner:eedc_hdp_s_d-itm-e, createTime:1519807981, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col1, type:int, comment:null), FieldSchema(name:col2, type:int, comment:null)], location:hdfs://DBDP-Dev/apps/hive/warehouse/tablea1, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{totalSize=0, rawDataSize=0, numRows=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=0, transient_lastDdlTime=1519807981}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Detailed Table Information Table(tableName:tablea2, dbName:default, owner:eedc_hdp_s_d-itm-e, createTime:1519807982, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col3, type:int, comment:null), FieldSchema(name:col4, type:int, comment:null)], location:hdfs://DBDP-Dev/apps/hive/warehouse/tablea2, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{totalSize=0, rawDataSize=0, numRows=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=0, transient_lastDdlTime=1519807982}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.08 seconds, Fetched: 4 row(s)
我正在尝试从上面的数据生成一个表名|列名,如下所示
tablea1|col1
tablea1|col2
tablea2|col3
tablea2|col4
我能够生成 2 个命令来生成每一列
grep -o 'Table(tableName:[^,]*' sample_file | awk -F ':' '{ print }'
给出第一列
tablea1
tablea2
grep -o 'FieldSchema(name:[^,]*' sample_file | awk -F ':' '{ print }' | uniq
给出第二列
col1
col2
col3
col4
但我无法继续进行并获得所需的输出
tablea1|col1
tablea1|col2
tablea2|col3
tablea2|col4
你能帮忙吗。或者有更简单的方法吗?
使用 grep 的 -n
选项和 -o
来携带匹配的行号。
然后使用 join
命令,通过加入行号作为键来获得所需的输出。参考:Join了解使用的参数。
grep -on 'Table(tableName:[^,]*' sample_file | awk -F ':' '{ OFS="|";print ,}' >file1
grep -on 'FieldSchema(name:[^,]*' sample_file | awk -F ':' '{ OFS="|";print ,}' >file2
join -t "|" -1 1 -2 1 -o '1.2,2.2' <(sort file1) <(sort file2)
我们也可以写一个 REGEXP
或 awk
one-liner 来得到想要的结果,但我觉得上面的会更清晰。
我有一组重复的下面的文本,它们本质上是描述来自 2 个 hadoop hive 表 tablea1 和 tablea2 的扩展输出并显示其属性。
Detailed Table Information Table(tableName:tablea1, dbName:default, owner:eedc_hdp_s_d-itm-e, createTime:1519807981, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col1, type:int, comment:null), FieldSchema(name:col2, type:int, comment:null)], location:hdfs://DBDP-Dev/apps/hive/warehouse/tablea1, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{totalSize=0, rawDataSize=0, numRows=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=0, transient_lastDdlTime=1519807981}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Detailed Table Information Table(tableName:tablea2, dbName:default, owner:eedc_hdp_s_d-itm-e, createTime:1519807982, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col3, type:int, comment:null), FieldSchema(name:col4, type:int, comment:null)], location:hdfs://DBDP-Dev/apps/hive/warehouse/tablea2, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{totalSize=0, rawDataSize=0, numRows=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=0, transient_lastDdlTime=1519807982}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.08 seconds, Fetched: 4 row(s)
我正在尝试从上面的数据生成一个表名|列名,如下所示
tablea1|col1
tablea1|col2
tablea2|col3
tablea2|col4
我能够生成 2 个命令来生成每一列
grep -o 'Table(tableName:[^,]*' sample_file | awk -F ':' '{ print }'
给出第一列
tablea1
tablea2
grep -o 'FieldSchema(name:[^,]*' sample_file | awk -F ':' '{ print }' | uniq
给出第二列
col1
col2
col3
col4
但我无法继续进行并获得所需的输出
tablea1|col1
tablea1|col2
tablea2|col3
tablea2|col4
你能帮忙吗。或者有更简单的方法吗?
使用 grep 的 -n
选项和 -o
来携带匹配的行号。
然后使用 join
命令,通过加入行号作为键来获得所需的输出。参考:Join了解使用的参数。
grep -on 'Table(tableName:[^,]*' sample_file | awk -F ':' '{ OFS="|";print ,}' >file1
grep -on 'FieldSchema(name:[^,]*' sample_file | awk -F ':' '{ OFS="|";print ,}' >file2
join -t "|" -1 1 -2 1 -o '1.2,2.2' <(sort file1) <(sort file2)
我们也可以写一个 REGEXP
或 awk
one-liner 来得到想要的结果,但我觉得上面的会更清晰。