为什么正则表达式卡住了？

Question

我已经逐步更新了一个正则表达式来清理数据，发现它对字符串运行了无限长。这是要测试的 Python 代码。

import re

re.sub(r'(([-.]?\d+)|(\d+))+$', '', 'bcn16-01081-210300-16-20160829-ca')

我想这是因为 \d+ 在组中重复，因为如果将 reg exp 简化为 ([-.]\d+|\d+)+$，它就会起作用。但我想知道更准确的。有人知道吗？

Answer 1

在(([-.]?\d+)|(\d+))+$中，[-.]?\d+和\d+实际上是在匹配相同的字符串。当在右侧应用 + 时，表达式开始以类似于 notorious (a+)*$ pattern.

的方式运行

使用

[-.]?\d+(?:[-.]\d+)*$

由于嵌套 +，

([-.]\d+|\d+)+$ 仍然容易发生灾难性的回溯。查看更多关于 catastrophic backtracking (and here).

说明

--------------------------------------------------------------------------------
  [-.]?                    any character of: '-', '.' (optional
                           (matching the most amount possible))
--------------------------------------------------------------------------------
  \d+                      digits (0-9) (1 or more times (matching
                           the most amount possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    [-.]                     any character of: '-', '.'
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
  )*                       end of grouping
--------------------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string

为什么正则表达式卡住了？

Why regular expression is stuck?

python

regex

regex-group

python-re