解析以串行方式而不是表格、json 等方式记录的数据的最佳方法?
Best method to parse data logged in a serial manner instead of as tabular, json etc?
我有一组日志文件,格式基本都是这个例子(file1.text):
================================================
Running taskId=[updateFieldInTbl]
startTime: 16:03:34,580
------------------------------------------------
INFO:DBExecute: SQL=[ UPDATE tbl set field = value where thing > 0; ]
SQL: UPDATE tbl set field = value where thing > 0
Statement affected [746664] rows.
------------------------------------------------
Finished taskId=[updateFieldInTbl]
endTime: 16:06:30,571
elapsed: 00:02:55,991
failure: false
anyFailure: false
================================================
================================================
Running taskId=[calculateChecksum]
startTime: 16:06:30,571
------------------------------------------------
INFO:DBExecute: SQL=[ update tbl set checksum = MD5(CONCAT_WS('',field, field2, field3)); ]
SQL: update tbl set checksum = MD5(CONCAT_WS('',field, field2, field3));
Statement affected [9608630] rows.
================================================
===== Greater than 5 minutes Review! ==========
================================================
------------------------------------------------
Finished taskId=[calculateChecksum]
endTime: 16:44:04,473
elapsed: 00:37:33,901
failure: false
anyFailure: false
================================================
================================================
Running taskId=[deleteMatchingChecksum]
startTime: 16:44:04,473
------------------------------------------------
INFO:DBExecute: SQL=[ delete tbl from tbl inner join other on tbl.checksum = other.checksum; ]
SQL: delete tbl from tbl inner join other on tbl.checksum = other.checksum;
Statement affected [9276213] rows.
================================================
===== Greater than 5 minutes Review! ==========
================================================
------------------------------------------------
Finished taskId=[deleteMatchingChecksum]
endTime: 17:49:26,817
elapsed: 01:05:22,344
failure: false
anyFailure: false
================================================
================================================
Running taskId=[deletemissinguserDataChecksum]
startTime: 17:49:26,817
------------------------------------------------
INFO:DBExecute: SQL=[ delete from tbl where some_id =0; ]
SQL: delete from tbl where some_id =0;
Statement affected [0] rows.
------------------------------------------------
Finished taskId=[deletemissinguserDataChecksum]
endTime: 17:49:26,847
elapsed: 00:00:00,030
failure: false
anyFailure: false
================================================
我想将其中的每一个都转换成如下所示:
file1 | taskId | startTime | endTime | elapsed | rowsAffected | Info | failure | anyFailure
file1 | updateFieldInTbl | 16:03:34 | 16:06:20 | 00:02:55 | 746664 | SQL=[ UPDATE tbl set field = value where thing > 0; ] | false | false
file1 | calculateChecksum | 16:06:30 | 16:44:04 | 00:37:33 | 9608630 | SQL=[ update tbl set checksum = MD5(CONCAT_WS('',field, field2, field3)); ] | false | false
file1 | deleteMatchingChecksum | 16:44:04 | 17:49:26 | 01:05:22 | 9276213 | SQL=[ delete tbl from tbl inner join other on tbl.checksum = other.checksum; ] | false | false
一般来说,我只是启动系统来记录数据库 table 所以日志已经是一种易于使用的格式,但目前不是一个选项, 所以我必须将现有的日志解析成类似有用的东西。
您会推荐什么工具?我认为目标是尽可能使用 bash 脚本构建一些东西。将不胜感激有关如何构建解析器的任何指导。
我建议Awk
处理:
awk 'NR==1{
fn=substr(FILENAME,0,length(FILENAME)-5);
print fn" | taskId | startTime | endTime | elapsed | rowsAffected | Info | failure | anyFailure"
}
/Running taskId/{ gsub(/^.+=\[|\]$/, ""); taskId=[=10=] }
/startTime:/{ sub(/,.*/,"",); startTime= }
/INFO:/{ sub(/^INFO:DBExecute: /,""); info=[=10=] }
/ affected/{ gsub(/\[|\]/,"",); affected= }
/endTime/{ sub(/,.*/,"",); endTime= }
/elapsed/{ sub(/,.*/,"",); elapsed= }
/^failure/{ fail= }
/anyFailure/{
printf "%s | %s | %s | %s | %s | %d | %s | %s | %s\n",
fn, taskId, startTime, endTime, elapsed, affected, info, fail,
}' file1.text
输出:
file1 | taskId | startTime | endTime | elapsed | rowsAffected | Info | failure | anyFailure
file1 | updateFieldInTbl | 16:03:34 | 16:06:30 | 00:02:55 | 746664 | SQL=[ UPDATE tbl set field = value where thing > 0; ] | false | false
file1 | calculateChecksum | 16:06:30 | 16:44:04 | 00:37:33 | 9608630 | SQL=[ update tbl set checksum = MD5(CONCAT_WS('',field, field2, field3)); ] | false | false
file1 | deleteMatchingChecksum | 16:44:04 | 17:49:26 | 01:05:22 | 9276213 | SQL=[ delete tbl from tbl inner join other on tbl.checksum = other.checksum; ] | false | false
file1 | deletemissinguserDataChecksum | 17:49:26 | 17:49:26 | 00:00:00 | 0 | SQL=[ delete from tbl where some_id =0; ] | false | false
FWIW 我会尽可能避免使用特定的字段名称,没有必要测试所有的值,因为大多数输入行都遵循相同的格式,所以只需挑出 DON' 的几行T遵循通用格式:
$ cat tst.awk
BEGIN { OFS="," }
!NF || /^([^[:alpha:]]|SQL|Finished)/ { next }
{ tag = val = [=10=] }
/^Running/ {
prt()
gsub(/^[^ ]+ |=.*/,"",tag)
gsub(/.*\[|\].*/,"",val)
}
/^Statement/ {
tag = "rowsAffected"
gsub(/.*\[|\].*/,"",val)
}
/^[:[:alpha:]]+: / {
sub(/:.*/,"",tag)
sub(/^[:[:alpha:]]+: /,"",val)
}
{
tags[++numTags] = tag
tag2val[tag] = val
}
END { prt() }
function prt( tag,val,tagNr) {
if (numTags > 0) {
if ( ++recNr == 1 ) {
printf "\"%s\"%s", "file", OFS
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
printf "\"%s\"%s", tag, (tagNr<numTags ? OFS : ORS)
}
}
printf "\"%s\"%s", FILENAME, OFS
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
val = tag2val[tag]
gsub(/"/,"\"\"",val)
printf "\"%s\"%s", val, (tagNr<numTags ? OFS : ORS)
}
}
delete tags
delete tag2val
numTags = 0
}
我还会将其输出为 CSV 格式,这样您就可以将其读入 Excel 或对其进行任何您喜欢的操作:
$ awk -f tst.awk file1
"file","taskId","startTime","INFO","rowsAffected","endTime","elapsed","failure","anyFailure"
"file1","updateFieldInTbl","16:03:34,580","SQL=[ UPDATE tbl set field = value where thing > 0; ]","746664","16:06:30,571","00:02:55,991","false","false"
"file1","calculateChecksum","16:06:30,571","SQL=[ update tbl set checksum = MD5(CONCAT_WS('',field, field2, field3)); ]","9608630","16:44:04,473","00:37:33,901","false","false"
"file1","deleteMatchingChecksum","16:44:04,473","SQL=[ delete tbl from tbl inner join other on tbl.checksum = other.checksum; ]","9276213","17:49:26,817","01:05:22,344","false","false"
"file1","deletemissinguserDataChecksum","17:49:26,817","SQL=[ delete from tbl where some_id =0; ]","0","17:49:26,847","00:00:00,030","false","false"
如果您真的很在意顺序,您可以微调它以按特定标签而不是数字顺序输出字段值。
我有一组日志文件,格式基本都是这个例子(file1.text):
================================================
Running taskId=[updateFieldInTbl]
startTime: 16:03:34,580
------------------------------------------------
INFO:DBExecute: SQL=[ UPDATE tbl set field = value where thing > 0; ]
SQL: UPDATE tbl set field = value where thing > 0
Statement affected [746664] rows.
------------------------------------------------
Finished taskId=[updateFieldInTbl]
endTime: 16:06:30,571
elapsed: 00:02:55,991
failure: false
anyFailure: false
================================================
================================================
Running taskId=[calculateChecksum]
startTime: 16:06:30,571
------------------------------------------------
INFO:DBExecute: SQL=[ update tbl set checksum = MD5(CONCAT_WS('',field, field2, field3)); ]
SQL: update tbl set checksum = MD5(CONCAT_WS('',field, field2, field3));
Statement affected [9608630] rows.
================================================
===== Greater than 5 minutes Review! ==========
================================================
------------------------------------------------
Finished taskId=[calculateChecksum]
endTime: 16:44:04,473
elapsed: 00:37:33,901
failure: false
anyFailure: false
================================================
================================================
Running taskId=[deleteMatchingChecksum]
startTime: 16:44:04,473
------------------------------------------------
INFO:DBExecute: SQL=[ delete tbl from tbl inner join other on tbl.checksum = other.checksum; ]
SQL: delete tbl from tbl inner join other on tbl.checksum = other.checksum;
Statement affected [9276213] rows.
================================================
===== Greater than 5 minutes Review! ==========
================================================
------------------------------------------------
Finished taskId=[deleteMatchingChecksum]
endTime: 17:49:26,817
elapsed: 01:05:22,344
failure: false
anyFailure: false
================================================
================================================
Running taskId=[deletemissinguserDataChecksum]
startTime: 17:49:26,817
------------------------------------------------
INFO:DBExecute: SQL=[ delete from tbl where some_id =0; ]
SQL: delete from tbl where some_id =0;
Statement affected [0] rows.
------------------------------------------------
Finished taskId=[deletemissinguserDataChecksum]
endTime: 17:49:26,847
elapsed: 00:00:00,030
failure: false
anyFailure: false
================================================
我想将其中的每一个都转换成如下所示:
file1 | taskId | startTime | endTime | elapsed | rowsAffected | Info | failure | anyFailure
file1 | updateFieldInTbl | 16:03:34 | 16:06:20 | 00:02:55 | 746664 | SQL=[ UPDATE tbl set field = value where thing > 0; ] | false | false
file1 | calculateChecksum | 16:06:30 | 16:44:04 | 00:37:33 | 9608630 | SQL=[ update tbl set checksum = MD5(CONCAT_WS('',field, field2, field3)); ] | false | false
file1 | deleteMatchingChecksum | 16:44:04 | 17:49:26 | 01:05:22 | 9276213 | SQL=[ delete tbl from tbl inner join other on tbl.checksum = other.checksum; ] | false | false
一般来说,我只是启动系统来记录数据库 table 所以日志已经是一种易于使用的格式,但目前不是一个选项, 所以我必须将现有的日志解析成类似有用的东西。
您会推荐什么工具?我认为目标是尽可能使用 bash 脚本构建一些东西。将不胜感激有关如何构建解析器的任何指导。
我建议Awk
处理:
awk 'NR==1{
fn=substr(FILENAME,0,length(FILENAME)-5);
print fn" | taskId | startTime | endTime | elapsed | rowsAffected | Info | failure | anyFailure"
}
/Running taskId/{ gsub(/^.+=\[|\]$/, ""); taskId=[=10=] }
/startTime:/{ sub(/,.*/,"",); startTime= }
/INFO:/{ sub(/^INFO:DBExecute: /,""); info=[=10=] }
/ affected/{ gsub(/\[|\]/,"",); affected= }
/endTime/{ sub(/,.*/,"",); endTime= }
/elapsed/{ sub(/,.*/,"",); elapsed= }
/^failure/{ fail= }
/anyFailure/{
printf "%s | %s | %s | %s | %s | %d | %s | %s | %s\n",
fn, taskId, startTime, endTime, elapsed, affected, info, fail,
}' file1.text
输出:
file1 | taskId | startTime | endTime | elapsed | rowsAffected | Info | failure | anyFailure
file1 | updateFieldInTbl | 16:03:34 | 16:06:30 | 00:02:55 | 746664 | SQL=[ UPDATE tbl set field = value where thing > 0; ] | false | false
file1 | calculateChecksum | 16:06:30 | 16:44:04 | 00:37:33 | 9608630 | SQL=[ update tbl set checksum = MD5(CONCAT_WS('',field, field2, field3)); ] | false | false
file1 | deleteMatchingChecksum | 16:44:04 | 17:49:26 | 01:05:22 | 9276213 | SQL=[ delete tbl from tbl inner join other on tbl.checksum = other.checksum; ] | false | false
file1 | deletemissinguserDataChecksum | 17:49:26 | 17:49:26 | 00:00:00 | 0 | SQL=[ delete from tbl where some_id =0; ] | false | false
FWIW 我会尽可能避免使用特定的字段名称,没有必要测试所有的值,因为大多数输入行都遵循相同的格式,所以只需挑出 DON' 的几行T遵循通用格式:
$ cat tst.awk
BEGIN { OFS="," }
!NF || /^([^[:alpha:]]|SQL|Finished)/ { next }
{ tag = val = [=10=] }
/^Running/ {
prt()
gsub(/^[^ ]+ |=.*/,"",tag)
gsub(/.*\[|\].*/,"",val)
}
/^Statement/ {
tag = "rowsAffected"
gsub(/.*\[|\].*/,"",val)
}
/^[:[:alpha:]]+: / {
sub(/:.*/,"",tag)
sub(/^[:[:alpha:]]+: /,"",val)
}
{
tags[++numTags] = tag
tag2val[tag] = val
}
END { prt() }
function prt( tag,val,tagNr) {
if (numTags > 0) {
if ( ++recNr == 1 ) {
printf "\"%s\"%s", "file", OFS
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
printf "\"%s\"%s", tag, (tagNr<numTags ? OFS : ORS)
}
}
printf "\"%s\"%s", FILENAME, OFS
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
val = tag2val[tag]
gsub(/"/,"\"\"",val)
printf "\"%s\"%s", val, (tagNr<numTags ? OFS : ORS)
}
}
delete tags
delete tag2val
numTags = 0
}
我还会将其输出为 CSV 格式,这样您就可以将其读入 Excel 或对其进行任何您喜欢的操作:
$ awk -f tst.awk file1
"file","taskId","startTime","INFO","rowsAffected","endTime","elapsed","failure","anyFailure"
"file1","updateFieldInTbl","16:03:34,580","SQL=[ UPDATE tbl set field = value where thing > 0; ]","746664","16:06:30,571","00:02:55,991","false","false"
"file1","calculateChecksum","16:06:30,571","SQL=[ update tbl set checksum = MD5(CONCAT_WS('',field, field2, field3)); ]","9608630","16:44:04,473","00:37:33,901","false","false"
"file1","deleteMatchingChecksum","16:44:04,473","SQL=[ delete tbl from tbl inner join other on tbl.checksum = other.checksum; ]","9276213","17:49:26,817","01:05:22,344","false","false"
"file1","deletemissinguserDataChecksum","17:49:26,817","SQL=[ delete from tbl where some_id =0; ]","0","17:49:26,847","00:00:00,030","false","false"
如果您真的很在意顺序,您可以微调它以按特定标签而不是数字顺序输出字段值。