如何处理输入中的不良尾随数据？

Question

我试图了解 ANTLR4 如何在 Python 环境中处理错误。我的最终代码需要检测并报告文件中无效的任何数据，无论它出现在何处。作为这项工作的一部分，我正在使用 py3antlr4book 中的示例来尝试一些基本场景。具体来说，我使用了 01-Hello 目录中的示例，并尝试了两个添加了虚假条目的不同输入文件：

Hello.g4

grammar Hello;            // Define a grammar called Hello
r  : 'hello' ID ;         // match keyword hello followed by an identifier
ID : [a-z]+ ;             // match lower-case identifiers
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines, \r (Windows)

bogus_first.txt

bogus
hello world

输出

line 1:0 extraneous input 'bogus' expecting 'hello'
(r bogus hello world)

bogus_last.txt

hello world
bogus

输出

(r hello world)

bogus_first.txt 的输出对我来说很有意义。它出错了，并指出了错误所在。 bogus_last.txt 的输出没有错误，也没有表明数据中有某种错误的输入。这至少让我感到惊讶。我尝试使用这个添加 ErrorListener 的建议，但这似乎没有捕捉到虚假条目。我还尝试添加一个 ErrorStrategy，但这似乎也没有捕捉到虚假条目。

下面是我用来实现 ErrorListener 和 ErrorStrategy 的代码。 inErrorRecoveryMode 似乎不在我想要的那一行，但我不确定我是否只是打印出正确的数据。

我需要对我的测试台进行哪些更改才能在类似示例 bogus_last.txt 的情况下出错？

test_hello.py

import sys
from antlr4 import *
from HelloLexer import HelloLexer
from HelloParser import HelloParser
from antlr4.error.ErrorListener import ErrorListener
from antlr4.error.ErrorStrategy import DefaultErrorStrategy

class MyErrorListener( ErrorListener ):

    def __init__(self):
        super().__init__()

    def syntaxError(self, recognizer, offendingSymbol, line, column, msg, e):
        raise Exception("Oh no!!")

    def reportAmbiguity(self, recognizer, dfa, startIndex, stopIndex, exact, ambigAlts, configs):
        raise Exception("Oh no!!")

    def reportAttemptingFullContext(self, recognizer, dfa, startIndex, stopIndex, conflictingAlts, configs):
        raise Exception("Oh no!!")

    def reportContextSensitivity(self, recognizer, dfa, startIndex, stopIndex, prediction, configs):
        raise Exception("Oh no!!")

class MyErrorStrategy(DefaultErrorStrategy):

    def __init__(self):
        super().__init__()

    def reset(self, parser):
        raise Exception("Oh no!!")

    def recoverInline(self, parser):
        raise Exception("Oh no!!")

    def recover(self, parser, excp):
        raise Exception("Oh no!!")

    def sync(self, parser):
        raise Exception("Oh no!!")

    def inErrorRecoveryMode(self, parser):
        ctx = parser._ctx
        print(self.lastErrorIndex)
        return super().inErrorRecoveryMode(parser)

    def reportError(self, parser, excp):
        raise Exception("Oh no!!")


def main(argv):
    input = FileStream(argv[1])
    lexer = HelloLexer(input)
    stream = CommonTokenStream(lexer)
    parser = HelloParser(stream)
    parser.addErrorListener( MyErrorListener() )
    parser._errHandler = MyErrorStrategy()
    tree = parser.r()
    print(tree.toStringTree(recog=parser))

if __name__ == '__main__':
    main(sys.argv)

Answer 1

事实是：

hello world
bogus

没有产生错误是因为解析器使用产生式 r : 'hello' ID ; 成功解析了 hello world 然后停止了。您没有告诉解析器使用令牌流中的所有令牌。如果你想强制解析器这样做，请将 EOF 标记添加到规则的末尾：

r  : 'hello' ID EOF;

然后输入：

hello world
bogus

会产生错误。但是此错误仅打印到您的 stderr 流并且解析器尝试恢复并继续解析。要让它失败，请执行以下操作：

import antlr4
from antlr4.error.ErrorListener import ErrorListener

from HelloLexer import HelloLexer
from HelloParser import HelloParser


class BailOnErrorListener(ErrorListener):
    def syntaxError(self, recognizer, offending_symbol, line: int, column: int, msg, error):
        raise RuntimeError(f'msg: {msg}')


def main(src):
    lexer = HelloLexer(antlr4.InputStream(src))
    parser = HelloParser(antlr4.CommonTokenStream(lexer))
    parser.removeErrorListeners()
    parser.addErrorListener(BailOnErrorListener())
    tree = parser.r()
    print(tree.toStringTree(recog=parser))


if __name__ == '__main__':
    src = "hello world\nbogus"
    main(src)

然后调用 parser.r() 将失败：

Traceback (most recent call last):
  ...
RuntimeError: msg: extraneous input 'bogus' expecting <EOF>

如何处理输入中的不良尾随数据？

How to handle bad trailing data in input?

python

antlr4

Hello.g4

bogus_first.txt

输出

bogus_last.txt

输出

test_hello.py