后视模式无效

Question

为什么这个正则表达式在 Python 中有效，但在 Ruby 中无效：

/(?<!([0-1\b][0-9]|[2][0-3]))/

很高兴听到解释以及如何在 Ruby

中绕过它

编辑整行代码：

re.sub(r'(?<!([0-1\b][0-9]|[2][0-3])):(?!([0-5][0-9])((?i)(am)|(pm)|(a\.m)|(p\.m)|(a\.m\.)|(p\.m\.))?\b)' , ':\n' , s)

基本上，当有冒号而不是时间时，我会尝试添加 '\n'。

Answer 1

Ruby 正则表达式引擎不允许在 look behinds 中捕获组。如果需要分组，可以使用non-capturing分组(?:):

[8] pry(main)> /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
SyntaxError: (eval):2: invalid pattern in look-behind: /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
[8] pry(main)> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/
=> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/

Docs:

 (?<!subexp)        negative look-behind

                     Subexp of look-behind must be fixed-width.
                     But top-level alternatives can be of various lengths.
                     ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.

                     In negative look-behind, capturing group isn't allowed,
                     but non-capturing group (?:) is allowed.

学习自 this answer。

Answer 2

累积Onigmo regex documentation、负向后视不支持捕获组。尽管它在正则表达式引擎中很常见，但并非所有引擎都将其视为错误，因此您可以在 re 和 Onigmo 正则表达式库中看到差异。

现在，至于你的正则表达式，它在 Ruby 和 Python 中都不能正常工作：class 中字符 class 中的 \b 47=] 正则表达式匹配 BACKSPACE (\x08) 字符，而不是单词边界。此外，当您在可选的 non-word 字符之后使用单词边界时，如果该字符出现在字符串中，则单词字符必须立即出现在该 non-word 字符的右侧。单词边界必须移到 m 之后 \.? 之前的右边。

当前方法的另一个缺陷是，lookbehinds 并不是像这里这样排除某些上下文的最佳方法。例如。您不能考虑时间数字和 am / pm 之间可变数量的空格。最好匹配您不想触摸的上下文，匹配并捕获您想要修改的上下文。因此，我们在这里需要两个主要的替代方案，一个在时间字符串中匹配 am/pm，另一个在所有其他上下文中匹配它们。

您的模式还有太多可以使用字符 classes 和 ? 量词合并的替代项。

Regex demo

\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?):
- \b - 单词边界
- ((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?) - 捕获组 1：
  - (?:[01]?[0-9]|2[0-3]) - 可选的 0 或 1 然后是任何数字或 2 然后是从 0 到 3 的数字
  - :[0-5][0-9] - : 然后是从 00 到 59
  - \s* - 0+ 个空格
  - [pa]\.?m\b\.? - a 或 p，一个可选的点，m，一个单词边界，一个可选的点
| - 或
\b[ap]\.?m\b\.? - 字边界，a或p，一个可选的点，m，一个字边界，一个可选的点

Python fixed solution:

import re
text = 'am pm  P.M.  10:56pm 10:43 a.m.'
rx = r'\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?'
result = re.sub(rx, lambda x: x.group(1) if x.group(1) else "\n", text, flags=re.I)

Ruby solution:

text = 'am pm  P.M.  10:56pm 10:43 a.m.'
rx = /\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?/i
result = text.gsub(rx) {  || "\n" }

输出：

"\n \n  \n  10:56pm 10:43 a.m."

Answer 3

@mrzasa 确实找到了问题所在。

但是.. 猜测您用 ':\n`
替换 non-time 冒号的意图我想可以这样做。也做一点空白 trim 。

(?i)(?<!\b[01][0-9])(?<!\b[2][0-3])([^\S\r\n]*:)[^\S\r\n]*(?![0-5][0-9](?:[ap]\.?m\b\.?)?)

PCRE - https://regex101.com/r/7TxbAJ/1 替换 \n

Python - https://regex101.com/r/w0oqdZ/1 替换 \n

可读版本

 (?i)
 (?<!
      \b [01] [0-9] 
 )
 (?<!
      \b [2] [0-3] 
 )
 (                             # (1 start)
      [^\S\r\n]* 
      :
 )                             # (1 end)
 [^\S\r\n]* 
 (?!
      [0-5] [0-9] 
      (?: [ap] \.? m \b \.? )?
 )

后视模式无效

Invalid pattern in look-behind

ruby

python

regex

regex-lookarounds