ERE - 将量词添加到具有内部组和反向引用的组

Question

正在尝试获取连续重复字母出现两次或三次的单词。无法使用 ERE

找到使用量词和捕获组的方法

$ grep --version | head -n1
grep (GNU grep) 2.25

$ # consecutive repeated letters occurring twice
$ grep -m5 -xiE '[a-z]*([a-z])[a-z]*[a-z]*([a-z])[a-z]*' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed

$ # no output for this, why?
$ grep -m5 -xiE '([a-z]*([a-z])[a-z]*){2}' /usr/share/dict/words

尽管

与 -P 一起工作

$ grep -m5 -xiP '([a-z]*([a-z])[a-z]*){2}' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed

$ grep -m5 -xiP '([a-z]*([a-z])[a-z]*){3}' /usr/share/dict/words
Chattahoochee
McConnell
Mississippi
Mississippian
Mississippians

感谢 Casimir et Hippolyte 提出更简单的输入和正则表达式来测试此行为

$ echo 'aazbb' | grep -E '(([a-z])[a-z]*){2}' || echo 'No match'
aazbb
$ echo 'aazbbycc' | grep -E '(([a-z])[a-z]*){2}([a-z])[a-z]*' || echo 'No match'
aazbbycc
$ echo 'aazbbycc' | grep -P '(([a-z])[a-z]*){3}' || echo 'No match'
aazbbycc

$ # failing case
$ echo 'aazbbycc' | grep -E '(([a-z])[a-z]*){3}' || echo 'No match'
No match

sed 也有同样的行为

$ sed --version | head -n1
sed (GNU sed) 4.2.2

$ echo 'aazbb' | sed -E '/(([a-z])[a-z]*){2}/! s/.*/No match/'
aazbb    
$ echo 'aazbbycc' | sed -E '/(([a-z])[a-z]*){2}([a-z])[a-z]*/! s/.*/No match/'
aazbbycc

$ # failing case
$ echo 'aazbbycc' | sed -E '/(([a-z])[a-z]*){3}/! s/.*/No match/'
No match

相关搜索链接，我查了一些，但没有得到任何接近这个问题的东西

如果新版本的 grep 或 sed 解决了这个问题，请告诉我。此外，如果在非 GNU 实现中出现此问题

Answer 1

~~我想 -E 不允许 Quantifiers，这就是为什么它只适用于 -P~~

匹配 2 个或多个连续的重复字母组：

grep -P '(?:([a-z])*([a-z])){1}' /usr/share/dict/words

匹配 3 组或更多组连续的重复字母：

grep -P '(?:([a-z])*([a-z])){2}' /usr/share/dict/words

选项：

-P, --perl-regexp         PATTERN is a Perl regular expression

Answer 2

更新

四处搜索后，我在 windows 盒子上安装了 gnugrep32，然后运行
一些测试：

我从一个旧的 SO post 中读到这个：

Non-greedy matching is not part of the Extended Regular Expression syntax supported by grep

因此，我们使用 [a-z]{0,20} 作为测试而不是 [a-z]* 或 [a-z]*?，其中 ? 被忽略（wtf？）

下面是使用总体 (){n} 的增量测试，看看它会在它之前走多远 STOPS BACKTRACKING
成帧。

下班时间

(([a-z])[a-z]{0,20}){1}   len = 2    rr
(([a-z])[a-z]{0,20}){2}   len = 4    rrrr
(([a-z])[a-z]{0,20}){3}   len = 25   rrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])[a-z]{0,20}){4}   len = 47   rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])[a-z]{0,20}){5}   len = 69   rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])[a-z]{0,20}){6}   len = 91   rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr

从 {3} 到 {6} 增量长度等于 22。

这恰好是捕获帧表达式的全长([a-z])[a-z]{0,20}
当它不回溯到之前的帧时。

结论是它会在 2 帧后自动停止回溯。

这是有道理的，例如，在 20 帧中，它达到 16 帧，发现无法匹配。
它是否应该回到第 1 帧并在那里进行调整，然后重新尝试一遍。

为什么是的。
不过现在消耗了这么多内存，臃肿的猪只好解压了。
使用这个陈旧的实用程序可能需要很长时间。
嘿，最好将其限制为 2 帧。

当然，(([a-z])[a-z]*){3}没有测试用例，因为贪心量词*
如果它们都是 [a-z] 并且永远不会
，则会在第二帧消耗整行开始第三帧。

Answer 3

$ # no output for this, why?
$ grep -m5 -xiE '([a-z]*([a-z])[a-z]*){2}' /usr/share/dict/words

因为您搜索的是内部有（至少）双字母的双组（两次相同）。像 abbcabbc [(...) = "abbc" 2 次] 而不是 2 个（最终相似的）组，每个组里面都有一个双字母，比如 abbcdeef.

有 2 个返回参考：

$ grep -iE '[a-z]*([a-z]){1,}[a-z]*([a-z]){1,}[a-z]*`

Answer 4

我提交了一个问题 https://debbugs.gnu.org/cgi/bugreport.cgi?bug=26864 并且现在更新手册以反映此类问题。

来自https://www.gnu.org/software/grep/manual/grep.html#Known-Bugs：

Back-references can greatly slow down matching, as they can generate exponentially many matching possibilities that can consume both time and memory to explore. Also, the POSIX specification for back-references is at times unclear. Furthermore, many regular expression implementations have back-reference bugs that can cause programs to return incorrect answers or even crash, and fixing these bugs has often been low-priority: for example, as of 2020 the GNU C library bug database contained back-reference bugs 52, 10844, 11053, 24269 and 25322, with little sign of forthcoming fixes. Luckily, back-references are rarely useful and it should be little trouble to avoid them in practical applications.

ERE - 将量词添加到具有内部组和反向引用的组

ERE - adding quantifier to group with inner group and back-reference

regex

grep

pcre

gnu

sed