ANTLR4 查找标记但 returns 截断的解析树
ANTLR4 finding tokens but returns truncated parse tree
我支持一个开源项目,我的基于 ANTLR 的解析是 return截断的 ParseTree。我相信我已经提供了重现该问题所需的内容。
给定一个使用 ANTLR 4.8-1 创建并配置如下的解析器:
public static Expressions parse(String mappingExpression) throws ParseException, IOException {
// Expressions can include references to properties within an
// application interface ("state"),
// properties within an event, and various operators and functions.
InputStream targetStream = new ByteArrayInputStream(mappingExpression.getBytes());
CharStream input = CharStreams.fromStream(targetStream,Charset.forName("UTF-8"));
MappingExpressionLexer lexer = new MappingExpressionLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
MappingExpressionParser parser = new MappingExpressionParser(tokens);
ParseTree tree = null;
BufferingErrorListener errorListener = new BufferingErrorListener();
try {
// remove the default error listeners which print to stderr
parser.removeErrorListeners();
lexer.removeErrorListeners();
// replace with error listener that buffer errors and allow us to retrieve them
// later
parser.addErrorListener(errorListener);
lexer.addErrorListener(errorListener);
tree = parser.expr();
并且我提供以下语句进行解析:
results.( $y := "test"; $bta := function($x) {( $count($x.billToAccounts) > 1 ? ($contains($join($x.billToAccounts, ','), "super") ? "Super" : "Standard") : ($contains($x.billToAccounts[0], "super") ? "Super" : "Standard") )}; { "users": $filter($, function($v, $i, $a) { $v.status = "PROVISIONED" }) { "firstName": $.profile.firstName, "lastName": $.profile.lastName, "email": $.profile.login, "lastLogin": $.lastLogin, "id" : $.id, "userType": $bta($.profile) } } )
解析树 returned 仅包含 "result" 标记,即使所有标记都已解析(如 _input.tokens 数组中所示)并且似乎都显示通道 0。
我希望解析器继续构建 _localCtx,MappingExpressionParser 语句:
_alt = getInterpreter().adaptivePredict(_input,17,_ctx);
returns 2 所以不会进一步扩展 _localCtx,它只包含一个带有 "result".
的 TerminalNodeContext
我已经尝试重新排列各种规则,并怀疑它与相对于 expr 规则的 parens 规则位置有关,但我遗漏了一些东西。
是什么导致 adaptivePredict 这么快变成 return 2?
/**
* (c) Copyright 2018, 2019 IBM Corporation
* 1 New Orchard Road,
* Armonk, New York, 10504-1722
* United States
* +1 914 499 1900
* support: Nathaniel Mills wnm3@us.ibm.com
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
*/
/* Antlr grammar defining the mapping expression language */
grammar MappingExpression;
/* The start rule; begin parsing here.
operator precedence is implied by the ordering in this list */
// =======================
// = PARSER RULES
// =======================
expr:
ID # id
| '*' ('.' expr)? # field_values
| DESCEND ('.' expr)? # descendant
| DOLLAR (('.' expr) | (ARR_OPEN expr ARR_CLOSE))? # context_ref
| ROOT ('.' expr)? # root_path
| '(' (expr (';' (expr)?)*)? ')' # parens
| ARR_OPEN exprOrSeqList? ARR_CLOSE # array_constructor
| OBJ_OPEN fieldList? OBJ_CLOSE # object_constructor
| expr ARR_OPEN ARR_CLOSE # to_array
| expr '.' expr # path
| expr ARR_OPEN expr ARR_CLOSE # array
| VAR_ID (emptyValues | exprValues) # function_call
| FUNCTIONID varList '{' exprList? '}' # function_decl
| VAR_ID ASSIGN (expr | (FUNCTIONID varList '{' exprList? '}')) # var_assign
| (FUNCTIONID varList '{' exprList? '}') exprValues # function_exec
| op=(TRUE|FALSE) # boolean
| op='-' expr # unary_op
| expr op=('*'|'/'|'%') expr # muldiv_op
| expr op=('+'|'-') expr # addsub_op
| expr '&' expr # concat_op
| expr 'in' expr # membership
| expr 'and' expr # logand
| expr 'or' expr # logor
| expr op=('<'|'<='|'>'|'>='|'!='|'=') expr # comp_op
| expr '?' expr (':' expr)? # conditional
| expr CHAIN expr # fct_chain
| VAR_ID # var_recall
| NUMBER # number
| STRING # string
| 'null' # null
;
fieldList : STRING ':' expr (',' STRING ':' expr)*;
exprList : expr (',' expr)* ;
varList : '(' (VAR_ID (',' VAR_ID)*)* ')' ;
exprValues : '(' exprList ')' ((',' exprOrSeq)* ')')?;
emptyValues : '(' ')' ;
seq : expr '..' expr ;
exprOrSeq : seq | expr ;
exprOrSeqList : exprOrSeq (',' exprOrSeq)* ;
// =======================
// = LEXER RULES
// =======================
TRUE : 'true';
FALSE : 'false';
STRING
: '\'' (ESC | ~['\])* '\''
| '"' (ESC | ~["\])* '"'
;
NULL : 'null';
ARR_OPEN : '[';
ARR_CLOSE : ']';
OBJ_OPEN : '{';
OBJ_CLOSE : '}';
DOLLAR : '$';
ROOT : '$$' ;
DESCEND : '**';
NUMBER
: INT '.' [0-9]+ EXP? // 1.35, 1.35E-9, 0.3
| INT EXP // 1e10 3e4
| INT // 3, 45
;
FUNCTIONID : 'function' ;
WS: [ \t\n]+ -> skip ; // ignore whitespace
COMMENT: '/*' .*? '*/' -> skip; // allow comments
// Assign token names used in above grammar
CHAIN : '~>' ;
ASSIGN : ':=' ;
MUL : '*' ;
DIV : '/' ;
ADD : '+' ;
SUB : '-' ;
REM : '%' ;
EQ : '=' ;
NOT_EQ : '!=' ;
LT : '<' ;
LE : '<=' ;
GT : '>' ;
GE : '>=' ;
CONCAT : '&';
VAR_ID : '$' ID ;
ID
: [a-zA-Z] [a-zA-Z0-9_]*
| BACK_QUOTE ~[`]* BACK_QUOTE;
// =======================
// = LEXER FRAGMENTS
// =======================
fragment ESC : '\' (["'\/bfnrt] | UNICODE) ;
fragment UNICODE : ([\u0080-\uFFFF] | 'u' HEX HEX HEX HEX) ;
fragment HEX : [0-9a-fA-F] ;
fragment INT : '0' | [1-9] [0-9]* ; // no leading zeros
fragment EXP : [Ee] [+\-]? INT ; // \- since - means "range" inside [...]
fragment SINGLE_QUOTE : '\'';
fragment DOUBLE_QUOTE : '"';
fragment BACK_QUOTE : '`';
虽然标记是为整个示例输入创建的,但并非所有标记都由解析器处理。如果你 运行 这个:
String mappingExpression = "results.(\n" +
" $y := \"test\"; \n" +
" $bta := function($x) {\n" +
" (\n" +
" $count($x.billToAccounts) > 1 \n" +
" ? ($contains($join($x.billToAccounts, ','), \"super\") ? \"Super\" : \"Standard\")\n" +
" : ($contains($x.billToAccounts[0], \"super\") ? \"Super\" : \"Standard\") \n" +
" )\n" +
" };\n" +
" { \n" +
" \"users\": $filter($, function($v, $i, $a) { \n" +
" $v.status = \"PROVISIONED\" \n" +
" })\n" +
" { \n" +
" \"firstName\": $.profile.firstName, \n" +
" \"lastName\": $.profile.lastName, \n" +
" \"email\": $.profile.login, \n" +
" \"lastLogin\": $.lastLogin, \n" +
" \"id\" : $.id, \n" +
" \"userType\": $bta($.profile) \n" +
" }\n" +
" } \n" +
")";
InputStream targetStream = new ByteArrayInputStream(mappingExpression.getBytes());
MappingExpressionLexer lexer = new MappingExpressionLexer(CharStreams.fromStream(targetStream, StandardCharsets.UTF_8));
MappingExpressionParser parser = new MappingExpressionParser(new CommonTokenStream(lexer));
ParseTree tree = parser.expr();
System.out.println(tree.toStringTree(parser));
将打印以下内容:
(expr results)
这意味着 expr
成功解析了第一个备选方案 ID
,然后停止。
要强制解析器使用所有标记,请引入以下规则:
expr_to_eof
: expr EOF
;
并更改:
ParseTree tree = parser.expr();
进入:
ParseTree tree = parser.expr_to_eof();
当您 运行 我再次发布的代码片段(使用默认错误侦听器!)时,您将在控制台上看到一些错误消息(即解析器未成功处理输入)。
如果我尝试解析输入:
results.(
$y := "test";
$bta := function($x) {
(
$count($x.billToAccounts) > 1
? ($contains($join($x.billToAccounts, ','), "super") ? "Super" : "Standard")
: ($contains($x.billToAccounts[0], "super") ? "Super" : "Standard")
)
};
{
"users": $filter($, function($v, $i, $a) {
$v.status = "PROVISIONED"
})
}
)
那么解析器就没有问题了。检查树:
{
"users": $filter($, function($v, $i, $a) {
$v.status = "PROVISIONED"
})
}
我看到它被识别为OBJ_OPEN fieldList? OBJ_CLOSE
,其中fieldList
定义如下:
fieldList : STRING ':' expr (',' STRING ':' expr)*;
即用逗号分隔的键值列表。所以如果你给解析器这样:
{
"users": $filter($, function($v, $i, $a) {
$v.status = "PROVISIONED"
})
{
"firstName": $.profile.firstName,
"lastName": $.profile.lastName,
"email": $.profile.login,
"lastLogin": $.lastLogin,
"id" : $.id,
"userType": $bta($.profile)
}
}
它无法正确解析它,因为:
{
"firstName": $.profile.firstName,
"lastName": $.profile.lastName,
"email": $.profile.login,
"lastLogin": $.lastLogin,
"id" : $.id,
"userType": $bta($.profile)
}
本身不是键值,两者之间没有逗号分隔。
这将正确解析它:
{
"users": $filter($, function($v, $i, $a) {
$v.status = "PROVISIONED"
}),
"some-key": {
"firstName": $.profile.firstName,
"lastName": $.profile.lastName,
"email": $.profile.login,
"lastLogin": $.lastLogin,
"id" : $.id,
"userType": $bta($.profile)
}
}
或者 $filter($, function($v, $i, $a) { $v.status = "PROVISIONED" })
允许紧跟在 { "firstName": ... }
之后,但我从你的语法中看不出这是有效的。
我支持一个开源项目,我的基于 ANTLR 的解析是 return截断的 ParseTree。我相信我已经提供了重现该问题所需的内容。
给定一个使用 ANTLR 4.8-1 创建并配置如下的解析器:
public static Expressions parse(String mappingExpression) throws ParseException, IOException {
// Expressions can include references to properties within an
// application interface ("state"),
// properties within an event, and various operators and functions.
InputStream targetStream = new ByteArrayInputStream(mappingExpression.getBytes());
CharStream input = CharStreams.fromStream(targetStream,Charset.forName("UTF-8"));
MappingExpressionLexer lexer = new MappingExpressionLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
MappingExpressionParser parser = new MappingExpressionParser(tokens);
ParseTree tree = null;
BufferingErrorListener errorListener = new BufferingErrorListener();
try {
// remove the default error listeners which print to stderr
parser.removeErrorListeners();
lexer.removeErrorListeners();
// replace with error listener that buffer errors and allow us to retrieve them
// later
parser.addErrorListener(errorListener);
lexer.addErrorListener(errorListener);
tree = parser.expr();
并且我提供以下语句进行解析:
results.( $y := "test"; $bta := function($x) {( $count($x.billToAccounts) > 1 ? ($contains($join($x.billToAccounts, ','), "super") ? "Super" : "Standard") : ($contains($x.billToAccounts[0], "super") ? "Super" : "Standard") )}; { "users": $filter($, function($v, $i, $a) { $v.status = "PROVISIONED" }) { "firstName": $.profile.firstName, "lastName": $.profile.lastName, "email": $.profile.login, "lastLogin": $.lastLogin, "id" : $.id, "userType": $bta($.profile) } } )
解析树 returned 仅包含 "result" 标记,即使所有标记都已解析(如 _input.tokens 数组中所示)并且似乎都显示通道 0。
我希望解析器继续构建 _localCtx,MappingExpressionParser 语句:
_alt = getInterpreter().adaptivePredict(_input,17,_ctx);
returns 2 所以不会进一步扩展 _localCtx,它只包含一个带有 "result".
的 TerminalNodeContext我已经尝试重新排列各种规则,并怀疑它与相对于 expr 规则的 parens 规则位置有关,但我遗漏了一些东西。
是什么导致 adaptivePredict 这么快变成 return 2?
/**
* (c) Copyright 2018, 2019 IBM Corporation
* 1 New Orchard Road,
* Armonk, New York, 10504-1722
* United States
* +1 914 499 1900
* support: Nathaniel Mills wnm3@us.ibm.com
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
*/
/* Antlr grammar defining the mapping expression language */
grammar MappingExpression;
/* The start rule; begin parsing here.
operator precedence is implied by the ordering in this list */
// =======================
// = PARSER RULES
// =======================
expr:
ID # id
| '*' ('.' expr)? # field_values
| DESCEND ('.' expr)? # descendant
| DOLLAR (('.' expr) | (ARR_OPEN expr ARR_CLOSE))? # context_ref
| ROOT ('.' expr)? # root_path
| '(' (expr (';' (expr)?)*)? ')' # parens
| ARR_OPEN exprOrSeqList? ARR_CLOSE # array_constructor
| OBJ_OPEN fieldList? OBJ_CLOSE # object_constructor
| expr ARR_OPEN ARR_CLOSE # to_array
| expr '.' expr # path
| expr ARR_OPEN expr ARR_CLOSE # array
| VAR_ID (emptyValues | exprValues) # function_call
| FUNCTIONID varList '{' exprList? '}' # function_decl
| VAR_ID ASSIGN (expr | (FUNCTIONID varList '{' exprList? '}')) # var_assign
| (FUNCTIONID varList '{' exprList? '}') exprValues # function_exec
| op=(TRUE|FALSE) # boolean
| op='-' expr # unary_op
| expr op=('*'|'/'|'%') expr # muldiv_op
| expr op=('+'|'-') expr # addsub_op
| expr '&' expr # concat_op
| expr 'in' expr # membership
| expr 'and' expr # logand
| expr 'or' expr # logor
| expr op=('<'|'<='|'>'|'>='|'!='|'=') expr # comp_op
| expr '?' expr (':' expr)? # conditional
| expr CHAIN expr # fct_chain
| VAR_ID # var_recall
| NUMBER # number
| STRING # string
| 'null' # null
;
fieldList : STRING ':' expr (',' STRING ':' expr)*;
exprList : expr (',' expr)* ;
varList : '(' (VAR_ID (',' VAR_ID)*)* ')' ;
exprValues : '(' exprList ')' ((',' exprOrSeq)* ')')?;
emptyValues : '(' ')' ;
seq : expr '..' expr ;
exprOrSeq : seq | expr ;
exprOrSeqList : exprOrSeq (',' exprOrSeq)* ;
// =======================
// = LEXER RULES
// =======================
TRUE : 'true';
FALSE : 'false';
STRING
: '\'' (ESC | ~['\])* '\''
| '"' (ESC | ~["\])* '"'
;
NULL : 'null';
ARR_OPEN : '[';
ARR_CLOSE : ']';
OBJ_OPEN : '{';
OBJ_CLOSE : '}';
DOLLAR : '$';
ROOT : '$$' ;
DESCEND : '**';
NUMBER
: INT '.' [0-9]+ EXP? // 1.35, 1.35E-9, 0.3
| INT EXP // 1e10 3e4
| INT // 3, 45
;
FUNCTIONID : 'function' ;
WS: [ \t\n]+ -> skip ; // ignore whitespace
COMMENT: '/*' .*? '*/' -> skip; // allow comments
// Assign token names used in above grammar
CHAIN : '~>' ;
ASSIGN : ':=' ;
MUL : '*' ;
DIV : '/' ;
ADD : '+' ;
SUB : '-' ;
REM : '%' ;
EQ : '=' ;
NOT_EQ : '!=' ;
LT : '<' ;
LE : '<=' ;
GT : '>' ;
GE : '>=' ;
CONCAT : '&';
VAR_ID : '$' ID ;
ID
: [a-zA-Z] [a-zA-Z0-9_]*
| BACK_QUOTE ~[`]* BACK_QUOTE;
// =======================
// = LEXER FRAGMENTS
// =======================
fragment ESC : '\' (["'\/bfnrt] | UNICODE) ;
fragment UNICODE : ([\u0080-\uFFFF] | 'u' HEX HEX HEX HEX) ;
fragment HEX : [0-9a-fA-F] ;
fragment INT : '0' | [1-9] [0-9]* ; // no leading zeros
fragment EXP : [Ee] [+\-]? INT ; // \- since - means "range" inside [...]
fragment SINGLE_QUOTE : '\'';
fragment DOUBLE_QUOTE : '"';
fragment BACK_QUOTE : '`';
虽然标记是为整个示例输入创建的,但并非所有标记都由解析器处理。如果你 运行 这个:
String mappingExpression = "results.(\n" +
" $y := \"test\"; \n" +
" $bta := function($x) {\n" +
" (\n" +
" $count($x.billToAccounts) > 1 \n" +
" ? ($contains($join($x.billToAccounts, ','), \"super\") ? \"Super\" : \"Standard\")\n" +
" : ($contains($x.billToAccounts[0], \"super\") ? \"Super\" : \"Standard\") \n" +
" )\n" +
" };\n" +
" { \n" +
" \"users\": $filter($, function($v, $i, $a) { \n" +
" $v.status = \"PROVISIONED\" \n" +
" })\n" +
" { \n" +
" \"firstName\": $.profile.firstName, \n" +
" \"lastName\": $.profile.lastName, \n" +
" \"email\": $.profile.login, \n" +
" \"lastLogin\": $.lastLogin, \n" +
" \"id\" : $.id, \n" +
" \"userType\": $bta($.profile) \n" +
" }\n" +
" } \n" +
")";
InputStream targetStream = new ByteArrayInputStream(mappingExpression.getBytes());
MappingExpressionLexer lexer = new MappingExpressionLexer(CharStreams.fromStream(targetStream, StandardCharsets.UTF_8));
MappingExpressionParser parser = new MappingExpressionParser(new CommonTokenStream(lexer));
ParseTree tree = parser.expr();
System.out.println(tree.toStringTree(parser));
将打印以下内容:
(expr results)
这意味着 expr
成功解析了第一个备选方案 ID
,然后停止。
要强制解析器使用所有标记,请引入以下规则:
expr_to_eof
: expr EOF
;
并更改:
ParseTree tree = parser.expr();
进入:
ParseTree tree = parser.expr_to_eof();
当您 运行 我再次发布的代码片段(使用默认错误侦听器!)时,您将在控制台上看到一些错误消息(即解析器未成功处理输入)。
如果我尝试解析输入:
results.(
$y := "test";
$bta := function($x) {
(
$count($x.billToAccounts) > 1
? ($contains($join($x.billToAccounts, ','), "super") ? "Super" : "Standard")
: ($contains($x.billToAccounts[0], "super") ? "Super" : "Standard")
)
};
{
"users": $filter($, function($v, $i, $a) {
$v.status = "PROVISIONED"
})
}
)
那么解析器就没有问题了。检查树:
{
"users": $filter($, function($v, $i, $a) {
$v.status = "PROVISIONED"
})
}
我看到它被识别为OBJ_OPEN fieldList? OBJ_CLOSE
,其中fieldList
定义如下:
fieldList : STRING ':' expr (',' STRING ':' expr)*;
即用逗号分隔的键值列表。所以如果你给解析器这样:
{
"users": $filter($, function($v, $i, $a) {
$v.status = "PROVISIONED"
})
{
"firstName": $.profile.firstName,
"lastName": $.profile.lastName,
"email": $.profile.login,
"lastLogin": $.lastLogin,
"id" : $.id,
"userType": $bta($.profile)
}
}
它无法正确解析它,因为:
{
"firstName": $.profile.firstName,
"lastName": $.profile.lastName,
"email": $.profile.login,
"lastLogin": $.lastLogin,
"id" : $.id,
"userType": $bta($.profile)
}
本身不是键值,两者之间没有逗号分隔。
这将正确解析它:
{
"users": $filter($, function($v, $i, $a) {
$v.status = "PROVISIONED"
}),
"some-key": {
"firstName": $.profile.firstName,
"lastName": $.profile.lastName,
"email": $.profile.login,
"lastLogin": $.lastLogin,
"id" : $.id,
"userType": $bta($.profile)
}
}
或者 $filter($, function($v, $i, $a) { $v.status = "PROVISIONED" })
允许紧跟在 { "firstName": ... }
之后,但我从你的语法中看不出这是有效的。