非参数化和参数化语句的不同标记名称或如何使用 RuleLexer 跳转到上一个标记

Question

如何为以下示例实现不同的令牌名称：

#someNameAttribute //where #someNameAttribute should be assigned to IDENTIFIER lexer rule
#someNameAttribute("2a3a796e-9870-4b88-9f2d-383eb9566613", 10) // where #someNameAttribute should be assigned to PARAMETERIZED_IDENTIFIER since we faced with parenthesis

我现在的语法（但它总是分配给 IDENTIFIER）：

grammar Rule;

ruleExpression
    : identifierExpression EOF | parameterizedIdentifierExpression EOF
    ;

identifierExpression
    : IDENTIFIER
    ;

parameterizedIdentifierExpression
    : PIDENTIFIER LPAREN UUID DELIMETER NUMERIC RPAREN
    ;

DELIMETER           : ',';
LPAREN              : '(';
RPAREN              : ')';
UUID                : '"'[0-9a-fA-F]+'-'[0-9a-fA-F]+'-'[1-5][0-9a-fA-F]+'-'[89abAB][0-9a-fA-F]+'-'[0-9a-fA-F]+'"';
NUMERIC             : [0-9]+ ( '.' [0-9]+ )? ;
IDENTIFIER          : '#' [a-zA-Z$_] [a-zA-Z$_0-9]*;
// PARAMETERIZED_IDENTIFIER         : { behind(LPAREN) }? IDENTIFIER;  // Tried to use semantic predicate but no luck. Might be used it wrong way
WS                  : [ \r\t\u000C\n]+ -> skip;

或者，如果有可能以某种方式检查 Java 代码中 #someNameAttribute 之后括号中的下一个标记 - 将很高兴听到如何做到这一点。我也尝试过这种方式，但是 RuleLexer.nextToken() 允许我检查下一个标记，但我无法再次跳转到上一个标记以继续整个语句（因此开始丢失一些标记）。

如何使用 Java 代码中的 RuleLexer 来预测要分配的令牌名称或如何跳转到上一个令牌？

Answer 1

尝试这样的事情（仅适用于 Java）：

grammar Rule;

any           : .*? EOF;

LPAREN        : '(';
RPAREN        : ')';
UUID          : '"'[0-9a-fA-F]+'-'[0-9a-fA-F]+'-'[1-5][0-9a-fA-F]+'-'[89abAB][0-9a-fA-F]+'-'[0-9a-fA-F]+'"';
NUMERIC       : [0-9]+ ( '.' [0-9]+ )? ;
PIDENTIFIER   : IDENTIFIER {_input.LA(1) == '('}?;
IDENTIFIER    : '#' [a-zA-Z$_] [a-zA-Z$_0-9]*;
WS            : [ \r\t\u000C\n]+ -> skip;
OTHER         : . ;

如果标识符和 ( 之间允许有空格，请执行以下操作：

grammar Rule;

@lexer::members {
  boolean spacesAndOpenParenAhead() {
    for (int i = 1; ; i++) {
      char ch = (char)_input.LA(i);
      if (ch == '(') {
        return true;
      }
      else if (ch != ' ' && ch != '\t' && ch != '\r' && ch != '\n') {
        return false;
      }
    }
  }
}

...

PIDENTIFIER         : IDENTIFIER {spacesAndOpenParenAhead()}?;
IDENTIFIER          : '#' [a-zA-Z$_] [a-zA-Z$_0-9]*;

当我运行下面的代码在我的两个示例语法中时：

import org.antlr.v4.runtime.*;

public class Main {

    public static void main(String[] args) throws Exception {

        String source = "#someNameAttribute\n" +
                "#someNameAttribute(\"2a3a796e-9870-4b88-9f2d-383eb9566613\", 10)";

        RuleLexer lexer = new RuleLexer(CharStreams.fromString(source));

        CommonTokenStream stream = new CommonTokenStream(lexer);
        stream.fill();

        for (Token t : stream.getTokens()) {
            System.out.printf("%-20s `%s`%n",
                    RuleLexer.VOCABULARY.getDisplayName(t.getType()),
                    t.getText().replace("\n", "\n"));
        }
    }
}

以下内容打印在我的控制台上：

IDENTIFIER           `#someNameAttribute`
PIDENTIFIER          `#someNameAttribute`
'('                  `(`
UUID                 `"2a3a796e-9870-4b88-9f2d-383eb9566613"`
OTHER                `,`
NUMERIC              `10`
')'                  `)`

非参数化和参数化语句的不同标记名称或如何使用 RuleLexer 跳转到上一个标记

Different tokens names for non-parameterized and parameterized statements OR how to jump to previous token with RuleLexer

java

antlr4