ERE - 将量词添加到具有内部组和反向引用的组

ERE - adding quantifier to group with inner group and back-reference

正在尝试获取连续重复字母出现两次或三次的单词。无法使用 ERE

找到使用量词和捕获组的方法
$ grep --version | head -n1
grep (GNU grep) 2.25

$ # consecutive repeated letters occurring twice
$ grep -m5 -xiE '[a-z]*([a-z])[a-z]*[a-z]*([a-z])[a-z]*' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed

$ # no output for this, why?
$ grep -m5 -xiE '([a-z]*([a-z])[a-z]*){2}' /usr/share/dict/words


尽管

-P 一起工作
$ grep -m5 -xiP '([a-z]*([a-z])[a-z]*){2}' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed

$ grep -m5 -xiP '([a-z]*([a-z])[a-z]*){3}' /usr/share/dict/words
Chattahoochee
McConnell
Mississippi
Mississippian
Mississippians


感谢 Casimir et Hippolyte 提出更简单的输入和正则表达式来测试此行为

$ echo 'aazbb' | grep -E '(([a-z])[a-z]*){2}' || echo 'No match'
aazbb
$ echo 'aazbbycc' | grep -E '(([a-z])[a-z]*){2}([a-z])[a-z]*' || echo 'No match'
aazbbycc
$ echo 'aazbbycc' | grep -P '(([a-z])[a-z]*){3}' || echo 'No match'
aazbbycc

$ # failing case
$ echo 'aazbbycc' | grep -E '(([a-z])[a-z]*){3}' || echo 'No match'
No match

sed 也有同样的行为

$ sed --version | head -n1
sed (GNU sed) 4.2.2

$ echo 'aazbb' | sed -E '/(([a-z])[a-z]*){2}/! s/.*/No match/'
aazbb    
$ echo 'aazbbycc' | sed -E '/(([a-z])[a-z]*){2}([a-z])[a-z]*/! s/.*/No match/'
aazbbycc

$ # failing case
$ echo 'aazbbycc' | sed -E '/(([a-z])[a-z]*){3}/! s/.*/No match/'
No match


相关搜索链接,我查了一些,但没有得到任何接近这个问题的东西

如果新版本的 grepsed 解决了这个问题,请告诉我。此外,如果在非 GNU 实现中出现此问题

我想 -E 不允许 Quantifiers,这就是为什么它只适用于 -P


匹配 2 个或多个连续的重复字母组:

grep -P '(?:([a-z])*([a-z])){1}' /usr/share/dict/words

匹配 3 组或更多组连续的重复字母:

grep -P '(?:([a-z])*([a-z])){2}' /usr/share/dict/words

选项:

-P, --perl-regexp         PATTERN is a Perl regular expression

更新

四处搜索后,我在 windows 盒子上安装了 gnugrep32,然后 运行
一些测试:

我从一个旧的 SO post 中读到这个:

Non-greedy matching is not part of the Extended Regular Expression syntax supported by grep

因此,我们使用 [a-z]{0,20} 作为测试而不是 [a-z]*[a-z]*?,其中 ? 被忽略(wtf?)

下面是使用总体 (){n} 的增量测试,看看它会在它之前走多远 STOPS BACKTRACKING
成帧。


下班时间

(([a-z])[a-z]{0,20}){1}   len = 2    rr
(([a-z])[a-z]{0,20}){2}   len = 4    rrrr
(([a-z])[a-z]{0,20}){3}   len = 25   rrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])[a-z]{0,20}){4}   len = 47   rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])[a-z]{0,20}){5}   len = 69   rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])[a-z]{0,20}){6}   len = 91   rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr

{3}{6} 增量长度等于 22。

这恰好是捕获帧表达式的全长([a-z])[a-z]{0,20}
当它回溯到之前的帧时。

结论是它会在 2 帧后自动停止 回溯。

这是有道理的,例如,在 20 帧中,它达到 16 帧,发现无法匹配。
它是否应该回到第 1 帧并在那里进行调整,然后重新尝试一遍。

为什么是的。
不过现在消耗了这么多内存,臃肿的猪只好解压了。
使用这个陈旧的实用程序可能需要很长时间。
嘿,最好将其限制为 2 帧。

当然,(([a-z])[a-z]*){3}没有测试用例,因为贪心量词*
如果它们都是 [a-z] 并且永远不会
,则会在第二帧消耗整行 开始第三帧。

$ # no output for this, why?
$ grep -m5 -xiE '([a-z]*([a-z])[a-z]*){2}' /usr/share/dict/words

因为您搜索的是内部有(至少)双字母的双组(两次相同)。像 abbcabbc [(...) = "abbc" 2 次] 而不是 2 个(最终相似的)组,每个组里面都有一个双字母,比如 abbcdeef.

有 2 个返回参考:

$ grep -iE '[a-z]*([a-z]){1,}[a-z]*([a-z]){1,}[a-z]*`

我提交了一个问题 https://debbugs.gnu.org/cgi/bugreport.cgi?bug=26864 并且现在更新手册以反映此类问题。

来自https://www.gnu.org/software/grep/manual/grep.html#Known-Bugs

Back-references can greatly slow down matching, as they can generate exponentially many matching possibilities that can consume both time and memory to explore. Also, the POSIX specification for back-references is at times unclear. Furthermore, many regular expression implementations have back-reference bugs that can cause programs to return incorrect answers or even crash, and fixing these bugs has often been low-priority: for example, as of 2020 the GNU C library bug database contained back-reference bugs 52, 10844, 11053, 24269 and 25322, with little sign of forthcoming fixes. Luckily, back-references are rarely useful and it should be little trouble to avoid them in practical applications.