为什么这个 Regexp 使用 pcre 而不是 Python 的步骤减少了 99.89%?

Why does this Regexp take 99.89% fewer steps using pcre rather than Python?

我刚刚在 regex101 编辑器中构建了这个表达式,但不小心忘记将其切换为 Python 风格语法。我不熟悉这些差异,但认为它们相当小。他们不是。

Perl/pcrePython 减少了 99.89% 的步数(6,377,715 对 6,565 步)

https://regex101.com/r/PRwtJY/3

正则表达式:

^(\d{1,3}) +((?:[a-zA-Z0-9\(\)\-≠,]+ )+) +£ *((?:[\d]  {1,4}|\d)+)∑([ \d]+)?

如有任何帮助,我们将不胜感激!谢谢。

编辑

数据源是从 PDF 中提取的 multi-line txt,导致输出不太完美(您可以看到 base source PDF here

我正在尝试提取特定行的框号、标题和任何存在(填写)的数字。如果您检查上面的 link,您可以看到完整的示例。 例如:

下面是显示正匹配的 Regex101 屏幕截图。最上面的行匹配项显示框编号 (155)、标题(交易利润)和编号 (5561)。

限制:

建议:使用较新的 regex module which supports atomic groups and possessive quantifiers. This cuts the steps needed about 50% compared to your initial PCRE expression (see a demo on regex101.com):

^
(\d{1,3})\s++
((?>[^£\n]+))£\s++
([ \d]+)(?>[^∑\n]+)∑\s++
([ \d]+)


要让它工作,你可以这样做:

import regex as re
rx = re.compile(r'''
    ^
    (\d{1,3})\s++
    ((?>[^£\n]+))£\s++
    ([ \d]+)(?>[^∑\n]+)∑\s++
    ([ \d]+)''', re.M | re.X)

matches = [[group.strip() for group in m.groups()] for m in rx.finditer(data)]
print(matches)

除了给定的以外,它会产生:

[['145', 'Total turnover from trade', '5    2    0  0  0', '0  0'], ['155', 'Trading profits', '5  5  6  1', '0  0'], ['165', 'Net trading profits ≠ box 155 minus box 160', '5    5  6  1', '0  0'], ['235', 'P rofits before other deductions and reliefs ≠ net sum of', '5  5  6  1', '0  0'], ['300', 'Profits before qualifying donations and group relief ≠', '5  5    6  1', '0     0'], ['315', 'Profits chargeable to Corporation Tax ≠', '5  5    6  1', '0     0'], ['475', 'Net Corporation Tax liability ≠ box 440 minus box 470', '1  0  5  6', '5  9'], ['510', 'Tax chargeable ≠ total of boxes 475, 480, 500 and 505', '1  0  5  6', '5  9'], ['525', 'Self-assessment of tax payable ≠ box 510 minus box 515', '1  0  5  6', '5  9'], ['600', 'Tax outstanding ≠', '1  0  5  6', '5  9']]