无法在 perl6 中编写用于解析具有特殊字符的行的语法
Unable to write a grammar in perl6 for parsing lines with special characters
我有代码:https://gist.github.com/ravbell/d94b37f1a346a1f73b5a827d9eaf7c92
use v6;
#use Grammar::Tracer;
grammar invoice {
token ws { \h*};
token super-word {\S+};
token super-phrase { <super-word> [\h <super-word>]*}
token line {^^ \h* [ <super-word> \h+]* <super-word>* \n};
token invoice-prelude-start {^^'Invoice Summary'\n}
token invoice-prelude-end {<line> <?before 'Start Invoice Details'\n>};
rule invoice-prelude {
<invoice-prelude-start>
<line>*?
<invoice-prelude-end>
<line>
}
}
multi sub MAIN(){
my $t = q :to/EOQ/;
Invoice Summary
asd fasdf
asdfasdf
asd 123-fasdf 34.00
qwe {rq} [we-r_q] we
Start Invoice Details
EOQ
say $t;
say invoice.parse($t,:rule<invoice-prelude>);
}
multi sub MAIN('test'){
use Test;
ok invoice.parse('Invoice Summary' ~ "\n", rule => <invoice-prelude-start>);
ok invoice.parse('asdfa {sf} asd-[fasdf] #werwerw'~"\n", rule => <line>);
ok invoice.parse('asdfawerwerw'~"\n", rule => <line>);
ok invoice.subparse('fasdff;kjaf asdf asderwret'~"\n"~'Start Invoice Details'~"\n",rule => <invoice-prelude-end>);
ok invoice.parse('fasdff;kjaf asdf asderwret'~"\n"~'Start Invoice Details'~"\n",rule => <invoice-prelude-end>);
done-testing;
}
我无法弄清楚为什么 rule <invoice-prelude>
上的解析失败并显示 Nil
。请注意,即使 .subparse
也会失败。
单个标记的测试正在通过 运行 MAIN
和 'test'
参数(当然 <invoice-prelude>
上的 .parse
除外失败,因为它不是完整的字符串)。
rule <invoice-prelude>
中应该修改什么才能正确解析 MAIN()
中的整个字符串 $t
?
注意$t
字符串最后一行末尾隐藏了一个space:
my $t = q :to/EOQ/;
Invoice Summary
asd fasdf
asdfasdf
asd 123-fasdf 34.00
qwe {rq} [we-r_q] we
Start Invoice Details␣ <-- Space at the end of the line
EOQ
这使得 <invoice-prelude-end>
标记失败,因为它包含前瞻性正则表达式 <?before 'Start Invoice Details'\n>
。此前瞻不包括行尾可能的 space(由于前瞻末尾的显式换行符 \n
)。因此,<invoice-prelude>
规则也无法匹配。
一个快速修复方法是删除 Start Invoice Details
.
行末尾的 space
首先,没有回溯的节俭量词*?
可能每次都匹配空字符串。您可以使用 regex
而不是 rule
。
其次,行尾有一个space,以Start Invoice Details
开头。
rule invoice-prelude-end {<line> <?before 'Start Invoice Details' \n>};
regex invoice-prelude {
<invoice-prelude-start>
<line>*?
<invoice-prelude-end>
<line>
}
如果你想避免回溯,你可以使用负前瞻。
token invoice-prelude-end { <line> };
rule invoice-prelude {
<invoice-prelude-start>
[<line> <!before 'Start Invoice Details' \n>]*
<invoice-prelude-end>
<line>
}
整个示例以灵感为灵感进行了一些更改:
use v6;
#use Grammar::Tracer;
grammar invoice {
token ws { <!ww>\h* }
token super-word { \S+ }
token line { <super-word>* % <.ws> }
token invoice-prelude-start { 'Invoice Summary' }
rule invoice-prelude-midline { <line> <!before \n <invoice-details-start> \n> }
token invoice-prelude-end { <line> }
token invoice-details-start { 'Start Invoice Details' }
rule invoice-prelude {
<invoice-prelude-start> \n
<invoice-prelude-midline> * %% \n
<invoice-prelude-end> \n
<invoice-details-start> \n
}
}
multi sub MAIN(){
my $t = q :to/EOQ/;
Invoice Summary
asd fasdf
asdfasdf
asd 123-fasdf 34.00
qwe {rq} [we-r_q] we
Start Invoice Details
EOQ
say $t;
say invoice.parse($t,:rule<invoice-prelude>);
}
TLDR: 问题是带有 Start Invoice Details
的测试输入行以您没有处理的水平空格结尾。
两种处理方式(除了改变输入)
# Explicitly: vvv
token invoice-prelude-end { <line> <?before 'Start Invoice Details' \h* \n>}
# Implicitly:
rule invoice-prelude-end { <line><?before 'Start Invoice Details' \n>}
# ^ must be a rule and there must be a space ^
# (uses the fact that you wrote your own <ws> token)
以下是我认为有用的其他内容
我会在 line
和 super-phrase
中使用 “分隔符” 功能 %
token super-phrase { <super-word>+ % \h } # single % doesn't capture trailing separator
token line {
^^ \h*
<super-word>* %% \h+ # double %% can capture optional trailing separator
\n
}
这些 [几乎] 与您所写的完全相同。
(你写的必须在 <line>
中两次匹配 <super-word>
失败,但这只需要失败一次。)
我会在 invoice-prelude
中使用环绕功能 ~
token invoice-prelude {
# zero or more <line>s surrounded by <invoice-prelude-start> and <invoice-prelude-end>
<invoice-prelude-start> ~ <invoice-prelude-end> <line>*?
<line> # I assume this is here for debugging
}
请注意,作为 rule
它实际上并没有获得任何好处,因为所有水平空格都已由其余代码处理。
我不认为发票前奏的最后一行有什么特别之处,所以从 invoice-prelude-end
中删除 <line>
。
(invoice-prelude
中的 <line>*?
将改为捕获它。)
token invoice-prelude-end {<?before 'Start Invoice Details' \h* \n>}
唯一可以受益于 rule
的正则表达式是 invoice-prelude-start
和 invoice-prelude-end
.
rule invoice-prelude-start {^^ Invoice Summary \n}
# `^^` is needed so the space ^ will match <.ws>
rule invoice-prelude-end {<?before ^^ Start Invoice Details $$>}
只有当您认为它与 Invoice Summary 
.
匹配时才有效
请注意,invoice-prelude-start
需要使用 \n
来捕获它,但是 invoice-prelude-end
可以使用 $$
代替,因为它不会捕获 \n
.
如果您将 super-word
更改为 \S+
以外的内容,那么您可能还想将 ws
更改为 \h+ | <.wb>
之类的内容。 (字边界)
#! /usr/bin/env perl6
use v6.d;
grammar invoice {
token TOP { # testing
<invoice-prelude>
<line>
}
token ws { \h* | <.wb> };
token super-word { \S+ };
token super-phrase { <super-word>+ % \h }
token line {
^^ \h*
<super-word>* %% \h+
\n
};
rule invoice-prelude-start {^^ Invoice Summary \n}
rule invoice-prelude-end {<?before ^^ Start Invoice Details $$>};
token invoice-prelude {
<invoice-prelude-start> ~ <invoice-prelude-end>
<line>*?
}
}
multi sub MAIN(){
my $t = q :to/EOQ/;
Invoice Summary
asd fasdf
asdfasdf
asd 123-fasdf 34.00
qwe {rq} [we-r_q] we
Start Invoice Details
EOQ
say $t;
say invoice.parse($t);
}
我有代码:https://gist.github.com/ravbell/d94b37f1a346a1f73b5a827d9eaf7c92
use v6;
#use Grammar::Tracer;
grammar invoice {
token ws { \h*};
token super-word {\S+};
token super-phrase { <super-word> [\h <super-word>]*}
token line {^^ \h* [ <super-word> \h+]* <super-word>* \n};
token invoice-prelude-start {^^'Invoice Summary'\n}
token invoice-prelude-end {<line> <?before 'Start Invoice Details'\n>};
rule invoice-prelude {
<invoice-prelude-start>
<line>*?
<invoice-prelude-end>
<line>
}
}
multi sub MAIN(){
my $t = q :to/EOQ/;
Invoice Summary
asd fasdf
asdfasdf
asd 123-fasdf 34.00
qwe {rq} [we-r_q] we
Start Invoice Details
EOQ
say $t;
say invoice.parse($t,:rule<invoice-prelude>);
}
multi sub MAIN('test'){
use Test;
ok invoice.parse('Invoice Summary' ~ "\n", rule => <invoice-prelude-start>);
ok invoice.parse('asdfa {sf} asd-[fasdf] #werwerw'~"\n", rule => <line>);
ok invoice.parse('asdfawerwerw'~"\n", rule => <line>);
ok invoice.subparse('fasdff;kjaf asdf asderwret'~"\n"~'Start Invoice Details'~"\n",rule => <invoice-prelude-end>);
ok invoice.parse('fasdff;kjaf asdf asderwret'~"\n"~'Start Invoice Details'~"\n",rule => <invoice-prelude-end>);
done-testing;
}
我无法弄清楚为什么 rule <invoice-prelude>
上的解析失败并显示 Nil
。请注意,即使 .subparse
也会失败。
单个标记的测试正在通过 运行 MAIN
和 'test'
参数(当然 <invoice-prelude>
上的 .parse
除外失败,因为它不是完整的字符串)。
rule <invoice-prelude>
中应该修改什么才能正确解析 MAIN()
中的整个字符串 $t
?
注意$t
字符串最后一行末尾隐藏了一个space:
my $t = q :to/EOQ/;
Invoice Summary
asd fasdf
asdfasdf
asd 123-fasdf 34.00
qwe {rq} [we-r_q] we
Start Invoice Details␣ <-- Space at the end of the line
EOQ
这使得 <invoice-prelude-end>
标记失败,因为它包含前瞻性正则表达式 <?before 'Start Invoice Details'\n>
。此前瞻不包括行尾可能的 space(由于前瞻末尾的显式换行符 \n
)。因此,<invoice-prelude>
规则也无法匹配。
一个快速修复方法是删除 Start Invoice Details
.
首先,没有回溯的节俭量词*?
可能每次都匹配空字符串。您可以使用 regex
而不是 rule
。
其次,行尾有一个space,以Start Invoice Details
开头。
rule invoice-prelude-end {<line> <?before 'Start Invoice Details' \n>};
regex invoice-prelude {
<invoice-prelude-start>
<line>*?
<invoice-prelude-end>
<line>
}
如果你想避免回溯,你可以使用负前瞻。
token invoice-prelude-end { <line> };
rule invoice-prelude {
<invoice-prelude-start>
[<line> <!before 'Start Invoice Details' \n>]*
<invoice-prelude-end>
<line>
}
整个示例以灵感为灵感进行了一些更改:
use v6;
#use Grammar::Tracer;
grammar invoice {
token ws { <!ww>\h* }
token super-word { \S+ }
token line { <super-word>* % <.ws> }
token invoice-prelude-start { 'Invoice Summary' }
rule invoice-prelude-midline { <line> <!before \n <invoice-details-start> \n> }
token invoice-prelude-end { <line> }
token invoice-details-start { 'Start Invoice Details' }
rule invoice-prelude {
<invoice-prelude-start> \n
<invoice-prelude-midline> * %% \n
<invoice-prelude-end> \n
<invoice-details-start> \n
}
}
multi sub MAIN(){
my $t = q :to/EOQ/;
Invoice Summary
asd fasdf
asdfasdf
asd 123-fasdf 34.00
qwe {rq} [we-r_q] we
Start Invoice Details
EOQ
say $t;
say invoice.parse($t,:rule<invoice-prelude>);
}
TLDR: 问题是带有 Start Invoice Details
的测试输入行以您没有处理的水平空格结尾。
两种处理方式(除了改变输入)
# Explicitly: vvv
token invoice-prelude-end { <line> <?before 'Start Invoice Details' \h* \n>}
# Implicitly:
rule invoice-prelude-end { <line><?before 'Start Invoice Details' \n>}
# ^ must be a rule and there must be a space ^
# (uses the fact that you wrote your own <ws> token)
以下是我认为有用的其他内容
我会在 line
和 super-phrase
%
token super-phrase { <super-word>+ % \h } # single % doesn't capture trailing separator
token line {
^^ \h*
<super-word>* %% \h+ # double %% can capture optional trailing separator
\n
}
这些 [几乎] 与您所写的完全相同。
(你写的必须在 <line>
中两次匹配 <super-word>
失败,但这只需要失败一次。)
我会在 invoice-prelude
~
token invoice-prelude {
# zero or more <line>s surrounded by <invoice-prelude-start> and <invoice-prelude-end>
<invoice-prelude-start> ~ <invoice-prelude-end> <line>*?
<line> # I assume this is here for debugging
}
请注意,作为 rule
它实际上并没有获得任何好处,因为所有水平空格都已由其余代码处理。
我不认为发票前奏的最后一行有什么特别之处,所以从 invoice-prelude-end
中删除 <line>
。
(invoice-prelude
中的 <line>*?
将改为捕获它。)
token invoice-prelude-end {<?before 'Start Invoice Details' \h* \n>}
唯一可以受益于 rule
的正则表达式是 invoice-prelude-start
和 invoice-prelude-end
.
rule invoice-prelude-start {^^ Invoice Summary \n}
# `^^` is needed so the space ^ will match <.ws>
rule invoice-prelude-end {<?before ^^ Start Invoice Details $$>}
只有当您认为它与 Invoice Summary 
.
请注意,invoice-prelude-start
需要使用 \n
来捕获它,但是 invoice-prelude-end
可以使用 $$
代替,因为它不会捕获 \n
.
如果您将 super-word
更改为 \S+
以外的内容,那么您可能还想将 ws
更改为 \h+ | <.wb>
之类的内容。 (字边界)
#! /usr/bin/env perl6
use v6.d;
grammar invoice {
token TOP { # testing
<invoice-prelude>
<line>
}
token ws { \h* | <.wb> };
token super-word { \S+ };
token super-phrase { <super-word>+ % \h }
token line {
^^ \h*
<super-word>* %% \h+
\n
};
rule invoice-prelude-start {^^ Invoice Summary \n}
rule invoice-prelude-end {<?before ^^ Start Invoice Details $$>};
token invoice-prelude {
<invoice-prelude-start> ~ <invoice-prelude-end>
<line>*?
}
}
multi sub MAIN(){
my $t = q :to/EOQ/;
Invoice Summary
asd fasdf
asdfasdf
asd 123-fasdf 34.00
qwe {rq} [we-r_q] we
Start Invoice Details
EOQ
say $t;
say invoice.parse($t);
}