正则表达式:PCRE 原子组不起作用
REGEX: PCRE atomic group doesn't work
在我的 PCRE 正则表达式中,我使用原子组来减少回溯。
<\/?\s*\b(?>a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|em(?:bed)?|f(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|h(?:[1-6r]|ead(?:er)?|tml)|i(?:frame|mg|nput|ns)?|kbd|l(?:abel|egend|i(?:nk)?)|m(?:a(?:in|p|rk)|et(?:a|er))|n(?:av|o(?:frames|script))|o(?:bject|l|pt(?:group|ion)|utput)|p(?:aram|icture|re|rogress)?|q|r[pt]|ruby|s|s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|ul?|v(?:ar|ideo)|wbr)\b
但在示例调试中,我看到在 f
检查结束后,它会更进一步用于其他选项。我试图在 f
检查失败后停止它,因此它不会检查表达式的其余部分。怎么了?
因为这就是原子组的工作原理。思路是:
at the current position, find the first sequence that matches the pattern inside atomic grouping and hold on to it.
(Source: Confusion with Atomic Grouping - how it differs from the Grouping in regular expression of Ruby?)
所以如果在原子组中没有匹配项,它将遍历所有选项。
您可以改用条件句:
</?\s*\b(?(?=a)a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|(?(?=b)b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|(?(?=c)c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|(?(?=d)d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|(?(?=e)em(?:bed)?|(?(?=f)f(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|(?(?=h)h(?:[1-6r]|ead(?:er)?|tml)|(?(?=i)i(?:frame|mg|nput|ns)?|(?(?=k)kbd|(?(?=l)l(?:abel|egend|i(?:nk)?)|(?(?=m)m(?:a(?:in|p|rk)|et(?:a|er))|(?(?=n)n(?:av|o(?:frames|script))|(?(?=o)o(?:bject|l|pt(?:group|ion)|utput)|(?(?=p)p(?:aram|icture|re|rogress)?|(?(?=q)q|(?(?=r)r[pt]|(?(?=r)ruby|(?(?=s)s|(?(?=s)s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|(?(?=t)t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|(?(?=u)ul?|(?(?=v)v(?:ar|ideo)|wbr))))))))))))))))))))))\b
我会假设你知道你在这里使用正则表达式在做什么,因为可能有人认为 PCRE 不是在 "tree"-like 中实现这种匹配的最佳方法时尚。不过我不介意。
使用条件的想法不错,但它以条件本身的形式增加了额外的步骤。此外,每个条件只能在两个方向上分支。
PCRE 有一个叫做 "backtracking control verbs" 的功能,它可以让你做你想做的事。他们有不同程度的控制,在这种情况下我建议的是最强的:
<\/?\s*\b(?>a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|em(?:bed)?|f(*COMMIT)(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|h(?:[1-6r]|ead(?:er)?|tml)|i(?:frame|mg|nput|ns)?|kbd|l(?:abel|egend|i(?:nk)?)|m(?:a(?:in|p|rk)|et(?:a|er))|n(?:av|o(?:frames|script))|o(?:bject|l|pt(?:group|ion)|utput)|p(?:aram|icture|re|rogress)?|q|r[pt]|ruby|s|s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|ul?|v(?:ar|ideo)|wbr)\b
https://regex101.com/r/p572K8/2
只需在 'f' 分支后添加一个 (*COMMIT)
动词,就可以将在这种情况下查找失败所需的步骤数减少一半。
(*COMMIT)
告诉引擎在那个时候提交匹配。如果找不到匹配项,它甚至不会再次尝试从 </
开始的匹配项。
要完全优化表达式,您必须在发生分支后的每个点添加 (*COMMIT)
。
您可以做的另一件事是尝试重新排序您的备选方案,以便优先考虑那些最常见的备选方案。这可能是您优化过程中需要考虑的其他因素。
在我的 PCRE 正则表达式中,我使用原子组来减少回溯。
<\/?\s*\b(?>a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|em(?:bed)?|f(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|h(?:[1-6r]|ead(?:er)?|tml)|i(?:frame|mg|nput|ns)?|kbd|l(?:abel|egend|i(?:nk)?)|m(?:a(?:in|p|rk)|et(?:a|er))|n(?:av|o(?:frames|script))|o(?:bject|l|pt(?:group|ion)|utput)|p(?:aram|icture|re|rogress)?|q|r[pt]|ruby|s|s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|ul?|v(?:ar|ideo)|wbr)\b
但在示例调试中,我看到在 f
检查结束后,它会更进一步用于其他选项。我试图在 f
检查失败后停止它,因此它不会检查表达式的其余部分。怎么了?
因为这就是原子组的工作原理。思路是:
at the current position, find the first sequence that matches the pattern inside atomic grouping and hold on to it. (Source: Confusion with Atomic Grouping - how it differs from the Grouping in regular expression of Ruby?)
所以如果在原子组中没有匹配项,它将遍历所有选项。 您可以改用条件句:
</?\s*\b(?(?=a)a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|(?(?=b)b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|(?(?=c)c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|(?(?=d)d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|(?(?=e)em(?:bed)?|(?(?=f)f(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|(?(?=h)h(?:[1-6r]|ead(?:er)?|tml)|(?(?=i)i(?:frame|mg|nput|ns)?|(?(?=k)kbd|(?(?=l)l(?:abel|egend|i(?:nk)?)|(?(?=m)m(?:a(?:in|p|rk)|et(?:a|er))|(?(?=n)n(?:av|o(?:frames|script))|(?(?=o)o(?:bject|l|pt(?:group|ion)|utput)|(?(?=p)p(?:aram|icture|re|rogress)?|(?(?=q)q|(?(?=r)r[pt]|(?(?=r)ruby|(?(?=s)s|(?(?=s)s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|(?(?=t)t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|(?(?=u)ul?|(?(?=v)v(?:ar|ideo)|wbr))))))))))))))))))))))\b
我会假设你知道你在这里使用正则表达式在做什么,因为可能有人认为 PCRE 不是在 "tree"-like 中实现这种匹配的最佳方法时尚。不过我不介意。
使用条件的想法不错,但它以条件本身的形式增加了额外的步骤。此外,每个条件只能在两个方向上分支。
PCRE 有一个叫做 "backtracking control verbs" 的功能,它可以让你做你想做的事。他们有不同程度的控制,在这种情况下我建议的是最强的:
<\/?\s*\b(?>a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|em(?:bed)?|f(*COMMIT)(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|h(?:[1-6r]|ead(?:er)?|tml)|i(?:frame|mg|nput|ns)?|kbd|l(?:abel|egend|i(?:nk)?)|m(?:a(?:in|p|rk)|et(?:a|er))|n(?:av|o(?:frames|script))|o(?:bject|l|pt(?:group|ion)|utput)|p(?:aram|icture|re|rogress)?|q|r[pt]|ruby|s|s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|ul?|v(?:ar|ideo)|wbr)\b
https://regex101.com/r/p572K8/2
只需在 'f' 分支后添加一个 (*COMMIT)
动词,就可以将在这种情况下查找失败所需的步骤数减少一半。
(*COMMIT)
告诉引擎在那个时候提交匹配。如果找不到匹配项,它甚至不会再次尝试从 </
开始的匹配项。
要完全优化表达式,您必须在发生分支后的每个点添加 (*COMMIT)
。
您可以做的另一件事是尝试重新排序您的备选方案,以便优先考虑那些最常见的备选方案。这可能是您优化过程中需要考虑的其他因素。