无法在 perl6 中编写用于解析具有特殊字符的行的语法

Unable to write a grammar in perl6 for parsing lines with special characters

我有代码:https://gist.github.com/ravbell/d94b37f1a346a1f73b5a827d9eaf7c92

use v6;
#use Grammar::Tracer;


grammar invoice {

    token ws { \h*};
    token super-word {\S+};
    token super-phrase { <super-word> [\h  <super-word>]*}
    token line {^^ \h* [ <super-word> \h+]* <super-word>* \n};

    token invoice-prelude-start {^^'Invoice Summary'\n}
    token invoice-prelude-end {<line> <?before 'Start Invoice Details'\n>};

    rule invoice-prelude {
        <invoice-prelude-start>
        <line>*?
        <invoice-prelude-end>
        <line>
    }
}

multi sub MAIN(){ 

    my $t = q :to/EOQ/; 
    Invoice Summary
    asd fasdf
    asdfasdf
    asd 123-fasdf 34.00
    qwe {rq} [we-r_q] we
    Start Invoice Details 
    EOQ


    say $t;
    say invoice.parse($t,:rule<invoice-prelude>);
}

multi sub MAIN('test'){
    use Test;
    ok invoice.parse('Invoice Summary' ~ "\n", rule => <invoice-prelude-start>);

    ok invoice.parse('asdfa {sf} asd-[fasdf] #werwerw'~"\n", rule => <line>);
    ok invoice.parse('asdfawerwerw'~"\n", rule => <line>);

    ok invoice.subparse('fasdff;kjaf asdf asderwret'~"\n"~'Start Invoice Details'~"\n",rule => <invoice-prelude-end>);
    ok invoice.parse('fasdff;kjaf asdf asderwret'~"\n"~'Start Invoice Details'~"\n",rule => <invoice-prelude-end>);
    done-testing;
}

我无法弄清楚为什么 rule <invoice-prelude> 上的解析失败并显示 Nil。请注意,即使 .subparse 也会失败。

单个标记的测试正在通过 运行 MAIN'test' 参数(当然 <invoice-prelude> 上的 .parse 除外失败,因为它不是完整的字符串)。

rule <invoice-prelude> 中应该修改什么才能正确解析 MAIN() 中的整个字符串 $t

注意$t字符串最后一行末尾隐藏了一个space:

my $t = q :to/EOQ/; 
    Invoice Summary
    asd fasdf
    asdfasdf
    asd 123-fasdf 34.00
    qwe {rq} [we-r_q] we
    Start Invoice Details␣   <-- Space at the end of the line
    EOQ

这使得 <invoice-prelude-end> 标记失败,因为它包含前瞻性正则表达式 <?before 'Start Invoice Details'\n>。此前瞻不包括行尾可能的 space(由于前瞻末尾的显式换行符 \n)。因此,<invoice-prelude> 规则也无法匹配。

一个快速修复方法是删除 Start Invoice Details.

行末尾的 space

首先,没有回溯的节俭量词*?可能每次都匹配空字符串。您可以使用 regex 而不是 rule

其次,行尾有一个space,以Start Invoice Details开头。

rule invoice-prelude-end {<line> <?before 'Start Invoice Details' \n>};

regex invoice-prelude {
    <invoice-prelude-start>
    <line>*?
    <invoice-prelude-end>
    <line>
}

如果你想避免回溯,你可以使用负前瞻。

token invoice-prelude-end { <line> };

rule invoice-prelude {
    <invoice-prelude-start>
    [<line> <!before 'Start Invoice Details' \n>]*
    <invoice-prelude-end>
    <line>
}

整个示例以灵感为灵感进行了一些更改:

use v6;
#use Grammar::Tracer;


grammar invoice {
    token ws { <!ww>\h* }
    token super-word { \S+ }
    token line { <super-word>* % <.ws> }

    token invoice-prelude-start   { 'Invoice Summary' }
    rule  invoice-prelude-midline { <line> <!before \n <invoice-details-start> \n> }
    token invoice-prelude-end     { <line> }
    token invoice-details-start   { 'Start Invoice Details' }

    rule invoice-prelude {
        <invoice-prelude-start> \n
        <invoice-prelude-midline> * %% \n
        <invoice-prelude-end> \n
        <invoice-details-start> \n
    }
}

multi sub MAIN(){

    my $t = q :to/EOQ/;
    Invoice Summary
    asd fasdf
    asdfasdf
    asd 123-fasdf 34.00
    qwe {rq} [we-r_q] we
    Start Invoice Details 
    EOQ


    say $t;
    say invoice.parse($t,:rule<invoice-prelude>);
}

TLDR: 问题是带有 Start Invoice Details  的测试输入行以您没有处理的水平空格结尾。

两种处理方式(除了改变输入)

# Explicitly:                                                       vvv
token invoice-prelude-end { <line> <?before 'Start Invoice Details' \h* \n>}

# Implicitly:
rule  invoice-prelude-end { <line><?before 'Start Invoice Details' \n>}
# ^ must be a rule                      and there must be a space ^
# (uses the fact that you wrote your own <ws> token)

以下是我认为有用的其他内容

我会在 linesuper-phrase

中使用 “分隔符” 功能 %
token super-phrase { <super-word>+ % \h } # single % doesn't capture trailing separator

token line {
  ^^ \h*
  <super-word>* %% \h+ # double %% can capture optional trailing separator
  \n
}

这些 [几乎] 与您所写的完全相同。 (你写的必须在 <line> 中两次匹配 <super-word> 失败,但这只需要失败一次。)


我会在 invoice-prelude

中使用环绕功能 ~
token invoice-prelude {
    # zero or more <line>s surrounded by <invoice-prelude-start> and <invoice-prelude-end>
    <invoice-prelude-start> ~ <invoice-prelude-end> <line>*?

    <line> # I assume this is here for debugging
}

请注意,作为 rule 它实际上并没有获得任何好处,因为所有水平空格都已由其余代码处理。


我不认为发票前奏的最后一行有什么特别之处,所以从 invoice-prelude-end 中删除 <line>。 (invoice-prelude 中的 <line>*? 将改为捕获它。)

token invoice-prelude-end {<?before 'Start Invoice Details' \h* \n>}

唯一可以受益于 rule 的正则表达式是 invoice-prelude-startinvoice-prelude-end.

rule  invoice-prelude-start {^^ Invoice Summary \n}
# `^^` is needed  so the space ^ will match <.ws>

rule  invoice-prelude-end {<?before ^^ Start Invoice Details $$>}

只有当您认为它与      Invoice    Summary    ␤.

匹配时才有效

请注意,invoice-prelude-start 需要使用 \n 来捕获它,但是 invoice-prelude-end 可以使用 $$ 代替,因为它不会捕获 \n .


如果您将 super-word 更改为 \S+ 以外的内容,那么您可能还想将 ws 更改为 \h+ | <.wb> 之类的内容。 (字边界)


#! /usr/bin/env perl6
use v6.d;

grammar invoice {
    token TOP { # testing
         <invoice-prelude>
         <line>
    }

    token ws { \h* | <.wb> };
    token super-word { \S+ };
    token super-phrase { <super-word>+ % \h }
    token line {
        ^^ \h*
        <super-word>* %% \h+
        \n
    };

    rule invoice-prelude-start {^^ Invoice Summary \n}
    rule invoice-prelude-end {<?before ^^ Start Invoice Details $$>};

    token invoice-prelude {
        <invoice-prelude-start> ~ <invoice-prelude-end>
            <line>*?
    }
}

multi sub MAIN(){ 
    my $t = q :to/EOQ/; 
    Invoice Summary
    asd fasdf
    asdfasdf
    asd 123-fasdf 34.00
    qwe {rq} [we-r_q] we
    Start Invoice Details 
    EOQ


    say $t;
    say invoice.parse($t);
}