Perl 可选捕获组不起作用?

Perl optional capture groups not working?

我有以下 sample.txt 文件:

2021-10-07 10:32:05,767 ERROR [LAWT2] blah.blah.blah - Message processing FAILED: <ExecutionReport blah="xxx" foo="yyy" SessionID="kkk" MoreStuff="zz"> Total time for which application threads were stopped: 0.0003858 seconds, Stopping threads took: 0.0000653 seconds
2021-10-07 10:31:32,902 ERROR [LAWT6] blah.blah.blah - Message processing FAILED: <NewOrderSingle SessionID="zkx" TargetSubID="ttt" Account="blah" MsgType="D" BookingTypeOverride="0" Symbol="6316" OtherField1="othervalue1" Otherfield2="othervalue2"/></D></NewOrderSingle>

我只想获取两个关键字段:“SessionID”和“MsgType”并像这样打印:

SessionID="kkk"|
SessionID="zkx"|MsgType="D"

换句话说:如果组匹配不存在,我只想打印空白。

我尝试了以下方法但没有成功:

$$ perl -ne '/ (SessionID=".*?")? .*(MsgType=".*?")? / and print "|\n"' sample.txt
SessionID="kkk"|
SessionID="zkx"|

这里有人能赐教吗?非常感谢。

这并不像看起来那么简单:

/ (SessionID=".*?")? .*(MsgType=".*?")? /
                     ~~

带下划线的部分匹配 MsgType,即使它存在,即使您向其中添加 ?。引擎会尝试从左边匹配最长的部分,因此如果匹配成功,它不会返回 MsgType。

但是可以使用环视断言:

/ (SessionID="[^"]*")? (?:(?!.*?MsgType)|.*? (MsgType=".*?")).* /

即SessionID 后面没有 MsgType,或者它就在那里,我们捕获它。

我不建议在捕获组上使用量词。另外,看起来日志包含 XML,提取它并使用解析器怎么样?

你可以使用

perl -ne '/\h(SessionID="[^"]*")?(?:\h++.*(MsgType="[^"]*"))?\h/ and print "|\n"' 

regex demo详情:

  • \h - 水平空格
  • (SessionID="[^"]*")? - 第 1 组:可选的 SessionID=",除 " 之外的任何零个或多个字符,然后是 "
  • (?:\h++.*(MsgType=".*?"))? - 一个可选的(但贪婪的)序列
    • \h++ - 一个或多个水平空格
    • .* - 除换行字符外的任何零个或多个字符尽可能多
    • (MsgType="[^"]*") - 第 2 组:SessionID=",除 " 以外的任何零个或多个字符,然后是 "
  • \h - 水平空格。

参见 online demo:

s='2021-10-07 10:32:05,767 ERROR [LAWT2] blah.blah.blah - Message processing FAILED: <ExecutionReport blah="xxx" foo="yyy" SessionID="kkk" MoreStuff="zz"> Total time for which application threads were stopped: 0.0003858 seconds, Stopping threads took: 0.0000653 seconds
2021-10-07 10:31:32,902 ERROR [LAWT6] blah.blah.blah - Message processing FAILED: <NewOrderSingle SessionID="zkx" TargetSubID="ttt" Account="blah" MsgType="D" BookingTypeOverride="0" Symbol="6316" OtherField1="othervalue1" Otherfield2="othervalue2"/></D></NewOrderSingle>'
perl -ne '/\h(SessionID=".*?")?(?:\h++.*(MsgType=".*?"))?\h/ and print "|\n"' <<< "$s"

这会打印:

SessionID="kkk"|
SessionID="zkx"|MsgType="D"

抱歉,我在问题中没有提到的一点是我计划提取多个字段并按确定的顺序打印它们,所以我最终写了一个 awk 脚本。

我把它放在这里以防其他人想要使用(我正在处理日志文件中的数千行,所以脚本是一个不错的选择)。

#!/usr/bin/awk
function get_field(the_array, the_field, the_line){
  for (key in the_array) {
      if (the_array[key] ~ the_field){
          if (the_line == "")
              the_line = the_array[key]
          else
              the_line = the_line "|" the_array[key]
          break
      }
  }
  return the_line
}
BEGIN{
    the_line = ""
}
{
    the_line = ""
    delete the_keys
    for(f=1;f<=NF;f++){
        if (($f ~ "^(ClOrdID|Symbol|MsgType|SessionID|OrdStatus)=") && (the_keys[$f] == "")){
            if (the_line == "")
                the_line = $f
            else
                the_line = $f"|"the_line
            the_keys[$f]++
        }
    }
    arr[the_line]++
}
END{
    for(i in arr) {
        if (i ~ "|"){
            the_line = ""
            split(i,aa,"|")
            # Print the fields in the correct order
            the_line = get_field(aa,"SessionID",the_line)
            the_line = get_field(aa,"ClOrdID",the_line)
            the_line = get_field(aa,"MsgType",the_line)
            the_line = get_field(aa,"OrdStatus",the_line)
            the_line = get_field(aa,"Symbol",the_line)
            print the_line
        } else {
            print(i)
        }
    }
}

使用它:

$$ awk -f aa.awk sample.txt
SessionID="kkk"
SessionID="zkx"|MsgType="D"|Symbol="6316"