为什么这个 Regexp 使用 pcre 而不是 Python 的步骤减少了 99.89%？

Question

我刚刚在 regex101 编辑器中构建了这个表达式，但不小心忘记将其切换为 Python 风格语法。我不熟悉这些差异，但认为它们相当小。他们不是。

Perl/pcre 比 Python 减少了 99.89% 的步数（6,377,715 对 6,565 步）

正则表达式：

^(\d{1,3}) +((?:[a-zA-Z0-9\(\)\-≠,]+ )+) +£ *((?:[\d]  {1,4}|\d)+)∑([ \d]+)?

如有任何帮助，我们将不胜感激！谢谢。

编辑

数据源是从 PDF 中提取的 multi-line txt，导致输出不太完美（您可以看到 base source PDF here）

我正在尝试提取特定行的框号、标题和任何存在（填写）的数字。如果您检查上面的 link，您可以看到完整的示例。 例如：

下面是显示正匹配的 Regex101 屏幕截图。最上面的行匹配项显示框编号 (155)、标题（交易利润）和编号 (5561)。

限制：

理想情况下提取您在 PCRE compiler 中看到的值 - 在匹配前后很少或没有额外的空格 - 只是框编号、标题和值。
仅在填写 number/value 时匹配（例如上例中的 5561，因此不匹配紧跟其后的行 - 框 160，而是匹配框 165）。
表格下方的格式发生了变化，我有一个单独的正则表达式，所以请忽略它。

Answer 1

建议：使用较新的 regex module which supports atomic groups and possessive quantifiers. This cuts the steps needed about 50% compared to your initial PCRE expression (see a demo on regex101.com):

^
(\d{1,3})\s++
((?>[^£\n]+))£\s++
([ \d]+)(?>[^∑\n]+)∑\s++
([ \d]+)

要让它工作，你可以这样做：

import regex as re
rx = re.compile(r'''
    ^
    (\d{1,3})\s++
    ((?>[^£\n]+))£\s++
    ([ \d]+)(?>[^∑\n]+)∑\s++
    ([ \d]+)''', re.M | re.X)

matches = [[group.strip() for group in m.groups()] for m in rx.finditer(data)]
print(matches)

除了给定的以外，它会产生：

[['145', 'Total turnover from trade', '5    2    0  0  0', '0  0'], ['155', 'Trading profits', '5  5  6  1', '0  0'], ['165', 'Net trading profits ≠ box 155 minus box 160', '5    5  6  1', '0  0'], ['235', 'P rofits before other deductions and reliefs ≠ net sum of', '5  5  6  1', '0  0'], ['300', 'Profits before qualifying donations and group relief ≠', '5  5    6  1', '0     0'], ['315', 'Profits chargeable to Corporation Tax ≠', '5  5    6  1', '0     0'], ['475', 'Net Corporation Tax liability ≠ box 440 minus box 470', '1  0  5  6', '5  9'], ['510', 'Tax chargeable ≠ total of boxes 475, 480, 500 and 505', '1  0  5  6', '5  9'], ['525', 'Self-assessment of tax payable ≠ box 510 minus box 515', '1  0  5  6', '5  9'], ['600', 'Tax outstanding ≠', '1  0  5  6', '5  9']]

为什么这个 Regexp 使用 pcre 而不是 Python 的步骤减少了 99.89%？

Why does this Regexp take 99.89% fewer steps using pcre rather than Python?

python

regex

pcre

编辑