如何解决 ANTLR CPP14 语法中的解析错误

Question

我正在使用下面的 ANTLR 语法来解析我的代码。

https://github.com/antlr/grammars-v4/tree/master/cpp

但是我在使用以下代码时出现解析错误:

TEST_F(TestClass, false_positive__N)
{
  static constexpr char text[] =
    R"~~~(; ModuleID = 'a.cpp'
            source_filename = "a.cpp"

   define private i32 @"__ir_hidden#100007_"(i32 %arg1) {
     ret i32 %arg1
   }

define i32 @main(i32 %arg1) {
   %1 = call i32 @"__ir_hidden#100007_"(i32 %arg1)
   ret i32 %1
}
)~~~";

 NameMock ns(text);
 ASSERT_EQ(std::string(text), ns.getSeed());
}

错误详情：

line 12:29 token recognition error at: '#1'
line 12:37 token recognition error at: '"(i32 %arg1)\n'
line 12:31 missing ';' at '00007_'
line 13:2 missing ';' at 'ret'
line 13:10 mismatched input '%' expecting {'alignas', '(', '[', '{', '=', ',', ';'}
line 14:0 missing ';' at '}'
line 15:0 mismatched input ')' expecting {'alignas', '(', '[', '{', '=', ',', ';'}
line 15:4 token recognition error at: '";\n'

parser/lexer 中需要进行哪些修改才能正确解析输入？非常感谢对此的任何帮助。提前致谢。

Answer 1

每当某个输入没有被正确解析时，我首先显示输入生成的所有标记。如果你这样做，你可能会明白为什么事情会出错。另一种方法是删除大部分源代码，并逐渐向其中添加更多代码行：在某个时刻解析器将失败，而您有一个解决它的起点。

因此，如果您转储输入正在创建的令牌，您将获得这些令牌：

Identifier                `TEST_F`
LeftParen                 `(`
Identifier                `TestClass`
Comma                     `,`
Identifier                `false_positive__N`
RightParen                `)`
LeftBrace                 `{`
Static                    `static`
Constexpr                 `constexpr`
Char                      `char`
Identifier                `text`
LeftBracket               `[`
RightBracket              `]`
Assign                    `=`
UserDefinedLiteral        `R"~~~(; ModuleID = 'a.cpp'\n            source_filename = "a.cpp"\n\n   define private i32 @"__ir_hidden#100007_"(i32 %arg1) {\n     ret i32 %arg1\n   }\n\ndefine i32 @main(i32 %arg1) {\n   %1 = call i32 @"__ir_hidden`
Directive                 `#100007_"(i32 %arg1)`
...

您可以看到输入 R"~~~( ... )~~~" 没有标记为 StringLiteral。请注意，永远不会创建 StringLiteral，因为在词法分析器语法的顶部有这条规则：

Literal:
    IntegerLiteral
    | CharacterLiteral
    | FloatingLiteral
    | StringLiteral
    | BooleanLiteral
    | PointerLiteral
    | UserDefinedLiteral;

导致 IntegerLiteral..UserDefinedLiteral 中的 none 被创建：它们都将成为 Literal 代币。最好将此 Literal 规则移至解析器。我必须承认，在浏览词法分析器语法时，它有点乱，修复 R"~~~( ... )~~~" 只会延迟另一个挥之不去的问题的出现:)。我很确定这个语法从未经过适当的测试，而且充满了错误。

如果您查看 StringLiteral 的词法分析器定义：

StringLiteral
 : Encodingprefix? '"' Schar* '"'
 | Encodingprefix? 'R' Rawstring
 ;

fragment Rawstring
 : '"' .*? '(' .*? ')' .*? '"'
 ;

很清楚为什么 '"' .*? '(' .*? ')' .*? '"' 不会匹配您的整个字符串文字：

您需要的是如下所示的规则：

StringLiteral
 : Encodingprefix? '"' Schar* '"'
 | Encodingprefix? 'R"' ~[(]* '(' ( . )* ')' ~["]* '"'
 ;

但这会导致 ( . )* 消耗过多：它会抓取每个字符，然后回溯到字符流中的最后一个引号（不是您想要的）。

你真正想要的是：

StringLiteral
 : Encodingprefix? '"' Schar* '"'
 | Encodingprefix? 'R"' ~[(]* '(' ( /* break out of this loop when we see `)~~~"` */ . )* ')' ~["]* '"'
 ;

break out of this look when we see ')~~~"' 部分可以用这样的 semantic predicate 完成：

lexer grammar CPP14Lexer;

@members {
  private boolean closeDelimiterAhead(String matched) {
    // Grab everything between the matched text's first quote and first '('. Prepend a ')' and append a quote
    String delimiter = ")" + matched.substring(matched.indexOf('"') + 1, matched.indexOf('(')) + "\"";
    StringBuilder ahead = new StringBuilder();

    // Collect as much characters ahead as there are `delimiter`-chars
    for (int n = 1; n <= delimiter.length(); n++) {
      if (_input.LA(n) == CPP14Lexer.EOF) {
        throw new RuntimeException("Missing delimiter: " + delimiter);
      }
      ahead.append((char) _input.LA(n));
    }

    return delimiter.equals(ahead.toString());
  }
}

...

StringLiteral
 : Encodingprefix? '"' Schar* '"'
 | Encodingprefix? 'R"' ~[(]* '(' ( {!closeDelimiterAhead(getText())}? . )* ')' ~["]* '"'
 ;

...

如果您现在转储令牌，您将看到：

Identifier                `TEST_F`
LeftParen                 `(`
Identifier                `TestClass`
Comma                     `,`
Identifier                `false_positive__N`
RightParen                `)`
LeftBrace                 `{`
Static                    `static`
Constexpr                 `constexpr`
Char                      `char`
Identifier                `text`
LeftBracket               `[`
RightBracket              `]`
Assign                    `=`
Literal                   `R"~~~(; ModuleID = 'a.cpp'\n            source_filename = "a.cpp"\n\n   define private i32 @"__ir_hidden#100007_"(i32 %arg1) {\n     ret i32 %arg1\n   }\n\ndefine i32 @main(i32 %arg1) {\n   %1 = call i32 @"__ir_hidden#100007_"(i32 %arg1)\n   ret i32 %1\n}\n)~~~"`
Semi                      `;`
...

它是：R"~~~( ... )~~~" 正确标记为单个标记（尽管是 Literal 标记而不是 StringLiteral...）。当输入像 R"~~~( ... )~~" 或 R"~~~( ... )~~~~" 时它会抛出异常，并且它会成功标记输入像 R"~~~( )~~" )~~~~" )~~~"

快速查看解析器语法，我看到像 StringLiteral 这样的标记被引用，但词法分析器永远不会生成这样的标记（正如我之前提到的）。

谨慎使用此语法。除了某种教育目的之外，我不建议（盲目地）将它用于任何其他目的。不要在生产中使用！

Answer 2

Lexer 的以下更改帮助我解决了原始字符串解析问题

 Stringliteral
   : Encodingprefix? '"' Schar* '"'
   | Encodingprefix? '"' Schar* '" GST_TIME_FORMAT'
   | Encodingprefix? 'R' Rawstring
 ;

fragment Rawstring
 : '"'              // Match Opening Double Quote
   ( /* Handle Empty D_CHAR_SEQ without Predicates
        This should also work
        '(' .*? ')'
      */
     '(' ( ~')' | ')'+ ~'"' )* (')'+)

   | D_CHAR_SEQ
         /*  // Limit D_CHAR_SEQ to 16 characters
            { ( ( getText().length() - ( getText().indexOf("\"") + 1 ) ) <= 16 ) }?
         */
     '('
     /* From Spec :
        Any member of the source character set, except
        a right parenthesis ) followed by the initial D_CHAR_SEQUENCE
        ( which may be empty ) followed by a double quote ".

      - The following loop consumes characters until it matches the
        terminating sequence of characters for the RAW STRING
      - The options are mutually exclusive, so Only one will
        ever execute in each loop pass
      - Each Option will execute at least once.  The first option needs to
        match the ')' character even if the D_CHAR_SEQ is empty. The second
        option needs to match the closing \" to fall out of the loop. Each
        option will only consume at most 1 character
      */
     (   //  Consume everthing but the Double Quote
       ~'"'
     |   //  If text Does Not End with closing Delimiter, consume the Double Quote
       '"'
       {
            !getText().endsWith(
                 ")"
               + getText().substring( getText().indexOf( "\"" ) + 1
                                    , getText().indexOf( "(" )
                                    )
               + '\"'
             )
       }?
     )*
   )
   '"'              // Match Closing Double Quote

   /*
   // Strip Away R"D_CHAR_SEQ(...)D_CHAR_SEQ"
   //  Send D_CHAR_SEQ <TAB> ... to Parser
   {
     setText( getText().substring( getText().indexOf("\"") + 1
                                 , getText().indexOf("(")
                                 )
            + "\t"
            + getText().substring( getText().indexOf("(") + 1
                                 , getText().lastIndexOf(")")
                                 )
            );
   }
    */
 ;

 fragment D_CHAR_SEQ     // Should be limited to 16 characters
    : D_CHAR+
 ;
 fragment D_CHAR
      /*  Any member of the basic source character set except
          space, the left parenthesis (, the right parenthesis ),
          the backslash \, and the control characters representing
           horizontal tab, vertical tab, form feed, and newline.
      */
    : '\u0021'..'\u0023'
    | '\u0025'..'\u0027'
    | '\u002a'..'\u003f'
    | '\u0041'..'\u005b'
    | '\u005d'..'\u005f'
    | '\u0061'..'\u007e'
 ;

如何解决 ANTLR CPP14 语法中的解析错误

How to resolve parsing error in ANTLR CPP14 Grammar

c++

grammar

antlr

context-free-grammar