这个递归正则表达式究竟是如何工作的?

How exactly does this recursive regex work?

这是 this question 的跟进。

看看这个模式:

(o(?1)?o)

匹配任意长度为2no序列,n≥1.
It works, see regex101.com(为更好的演示添加了单词边界)。
问题是:为什么?

在下文中,字符串的描述(匹配与否)将只是一个粗体数字或描述长度的粗体术语,如 2n.

分解(添加空格):

( o (?1)? o )
(           ) # Capture group 1
  o       o   # Matches an o each at the start and the end of the group
              # -> the pattern matches from the outside to the inside.
    (?1)?     # Again the regex of group 1, or nothing.
              # -> Again one 'o' at the start and one at the end. Or nothing.

我不明白为什么这不匹配 2n,但是 2n,因为我会将模式描述为 *未定义数量的 o o,相互堆叠。

可视化:

没有递归,2是一个匹配:

oo

一次递归,4是一个匹配:

o  o
 oo

到目前为止,很简单。

两次递归。显然是错误的,因为模式不匹配 6:

o    o
 o  o
  oo

但是为什么呢?看起来很符合这个模式。

我得出的结论是,重复的不仅仅是普通模式,否则 6 必须匹配。

但是根据regular-expressions.info

(?P<name>[abc])(?1)(?P>name) matches three letters like (?P<name>[abc])[abc][abc] does.

[abc])(?1){3} [...] is equivalent to ([abc])[abc]{3}

所以它似乎只是简单地重新匹配正则表达式代码,而没有关于捕获组的先前匹配的信息。

有人可以解释并想象一下为什么这个模式匹配 2n 而没有别的吗?

编辑:

评论中提到:

I doubt that referencing a capture group inside of itself is actually a supported case.

regular-expressions.info does mention the technique:

If you place a call inside the group that it calls, you'll have a recursive capturing group.

你正确理解了递归。单词边界在这里让你感到困惑。模式周围的 \b 要求正则表达式引擎仅在字符串前后没有字符字符的情况下匹配该字符串。

在这里查看递归是如何进行的:

( o      (?1)?         o )  => oo

(?1) 然后替换为 (o(?1)?o):

( o   (?>o(?1)?o)?     o )  => oo or oooo

再说一遍:

(o (?>o(?>o(?1)?o)?o)?  o) => oo, oooo, oooooo

参见regex demo without word boundaries

为什么要在上面的例子中添加(?>...) Each recursion level in PHP recursive regexes is atomic不像Perl,并且前面有一个级别失败,引擎不会返回到下一级别。

添加单词边界时,匹配的第一个 o 和最后一个 o 不能有任何其他单词字符 before/after。那么,ooo won't match 那么

也请参阅 rexegg.com 中的 Recursive Regular Expressions explained step by step and Word Boundary: \b

Why does oooooo not get matched as a whole but as oooo and oo?

同样,每个递归级别都是原子的。 oooooo 是这样匹配的:

  • (o(?1)?o) 匹配第一个 o
  • (?1)? 得到扩展,模式现在是 (o(?>o(?1)?o)?o) 并且它匹配输入
  • 中的第二个 o
  • 一直持续到(o(?>o(?>o(?>o(?>o(?>o(?>o(?1)?o)?o)?o)?o)?o)?o)?o)不再匹配输入,回溯发生,我们进入第6层,
  • 整个第 6 级递归级别也失败了,因为它无法匹配必要的 os
  • 数量
  • 这一直持续到可以匹配必要数量的 os 的级别。

参见 regex debugger:

这或多或少是 Wiktors 回答的后续——即使在删除单词边界之后,我也很难弄清楚为什么 oooooo (6) 与 oooooo,而 ooooooo (7) 匹配为 oooooo.

详细的工作原理如下:

扩展递归模式时,内部递归是原子的。使用我们的模式,我们可以将其展开为

(?>o(?>o(?>o(?>o(?>oo)?o)?o)?o)?o)

(在实际模式中,这个 get 再次展开,但这不会改变解释)

这是字符串的匹配方式 - 首先 oooooo (6)

(?>o(?>o(?>o(?>o(?>oo)?o)?o)?o)?o)
o   |ooooo                          <- first o gets matched by first atomic group
o   o   |oooo                       <- second o accordingly
o   o   o   |ooo                    <- third o accordingly
o   o   o   o   |oo                 <- fourth o accordingly
o   o   o   o   oo|                 <- fifth/sixth o by the innermost atomic group
                     ^              <- there is no more o to match, so backtracking starts - innermost ag is not matched, cursor positioned after 4th character
o   o   o   o   xx   o   |o         <- fifth o matches, fourth ag is successfully matched (thus no backtracking into it)
o   o   o   o   xx   o   o|         <- sixth o matches, third ag is successfully matched (thus no backtracking into it)
                           ^        <- no more o, backtracking again - third ag can't be backtracked in, so backtracking into second ag (with matching 3rd 0 times)
o   o                      |oo<oo   <- third and fourth o close second and first atomic group -> match returned  (4 os)

现在 ooooooo (7)

(?>o(?>o(?>o(?>o(?>oo)?o)?o)?o)?o)    
o   |oooooo                         <- first o gets matched by first atomic group
o   o   |ooooo                      <- second o accordingly
o   o   o   |oooo                   <- third o accordingly
o   o   o   o   |ooo                <- fourth o accordingly
o   o   o   o   oo|o                <- fifth/sixth o by the innermost atomic group
o   o   o   o   oo  o|              <- fourth ag is matched successfully (thus no backtracking into it)
                         ^          <- no more o, so backtracking starts here, no backtracking into fourth ag, try again 3rd
o   o   o                |ooo<o     <- 3rd ag can be closed, as well as second and first -> match returned (6 os)