ERE - 将量词添加到具有内部组和反向引用的组
ERE - adding quantifier to group with inner group and back-reference
正在尝试获取连续重复字母出现两次或三次的单词。无法使用 ERE
找到使用量词和捕获组的方法
$ grep --version | head -n1
grep (GNU grep) 2.25
$ # consecutive repeated letters occurring twice
$ grep -m5 -xiE '[a-z]*([a-z])[a-z]*[a-z]*([a-z])[a-z]*' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed
$ # no output for this, why?
$ grep -m5 -xiE '([a-z]*([a-z])[a-z]*){2}' /usr/share/dict/words
尽管
与 -P
一起工作
$ grep -m5 -xiP '([a-z]*([a-z])[a-z]*){2}' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed
$ grep -m5 -xiP '([a-z]*([a-z])[a-z]*){3}' /usr/share/dict/words
Chattahoochee
McConnell
Mississippi
Mississippian
Mississippians
感谢 Casimir et Hippolyte 提出更简单的输入和正则表达式来测试此行为
$ echo 'aazbb' | grep -E '(([a-z])[a-z]*){2}' || echo 'No match'
aazbb
$ echo 'aazbbycc' | grep -E '(([a-z])[a-z]*){2}([a-z])[a-z]*' || echo 'No match'
aazbbycc
$ echo 'aazbbycc' | grep -P '(([a-z])[a-z]*){3}' || echo 'No match'
aazbbycc
$ # failing case
$ echo 'aazbbycc' | grep -E '(([a-z])[a-z]*){3}' || echo 'No match'
No match
sed
也有同样的行为
$ sed --version | head -n1
sed (GNU sed) 4.2.2
$ echo 'aazbb' | sed -E '/(([a-z])[a-z]*){2}/! s/.*/No match/'
aazbb
$ echo 'aazbbycc' | sed -E '/(([a-z])[a-z]*){2}([a-z])[a-z]*/! s/.*/No match/'
aazbbycc
$ # failing case
$ echo 'aazbbycc' | sed -E '/(([a-z])[a-z]*){3}/! s/.*/No match/'
No match
相关搜索链接,我查了一些,但没有得到任何接近这个问题的东西
如果新版本的 grep
或 sed
解决了这个问题,请告诉我。此外,如果在非 GNU 实现中出现此问题
我想 -E
不允许 Quantifiers
,这就是为什么它只适用于 -P
匹配 2 个或多个连续的重复字母组:
grep -P '(?:([a-z])*([a-z])){1}' /usr/share/dict/words
匹配 3 组或更多组连续的重复字母:
grep -P '(?:([a-z])*([a-z])){2}' /usr/share/dict/words
选项:
-P, --perl-regexp PATTERN is a Perl regular expression
更新
四处搜索后,我在 windows 盒子上安装了 gnugrep32,然后 运行
一些测试:
我从一个旧的 SO post 中读到这个:
Non-greedy matching is not part of the Extended Regular Expression syntax supported by grep
因此,我们使用 [a-z]{0,20}
作为测试而不是 [a-z]*
或 [a-z]*?
,其中 ?
被忽略(wtf?)
下面是使用总体 (){n}
的增量测试,看看它会在它之前走多远 STOPS BACKTRACKING
成帧。
下班时间
(([a-z])[a-z]{0,20}){1} len = 2 rr
(([a-z])[a-z]{0,20}){2} len = 4 rrrr
(([a-z])[a-z]{0,20}){3} len = 25 rrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])[a-z]{0,20}){4} len = 47 rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])[a-z]{0,20}){5} len = 69 rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])[a-z]{0,20}){6} len = 91 rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
从 {3}
到 {6}
增量长度等于 22。
这恰好是捕获帧表达式的全长([a-z])[a-z]{0,20}
当它不回溯到之前的帧时。
结论是它会在 2 帧后自动停止 回溯。
这是有道理的,例如,在 20 帧中,它达到 16 帧,发现无法匹配。
它是否应该回到第 1 帧并在那里进行调整,然后重新尝试一遍。
为什么是的。
不过现在消耗了这么多内存,臃肿的猪只好解压了。
使用这个陈旧的实用程序可能需要很长时间。
嘿,最好将其限制为 2 帧。
当然,(([a-z])[a-z]*){3}
没有测试用例,因为贪心量词*
如果它们都是 [a-z] 并且永远不会
,则会在第二帧消耗整行
开始第三帧。
$ # no output for this, why?
$ grep -m5 -xiE '([a-z]*([a-z])[a-z]*){2}' /usr/share/dict/words
因为您搜索的是内部有(至少)双字母的双组(两次相同)。像 abbcabbc
[(...) = "abbc"
2 次] 而不是 2 个(最终相似的)组,每个组里面都有一个双字母,比如 abbcdeef
.
有 2 个返回参考:
$ grep -iE '[a-z]*([a-z]){1,}[a-z]*([a-z]){1,}[a-z]*`
我提交了一个问题 https://debbugs.gnu.org/cgi/bugreport.cgi?bug=26864 并且现在更新手册以反映此类问题。
来自https://www.gnu.org/software/grep/manual/grep.html#Known-Bugs:
Back-references can greatly slow down matching, as they can generate exponentially many matching possibilities that can consume both time and memory to explore. Also, the POSIX specification for back-references is at times unclear. Furthermore, many regular expression implementations have back-reference bugs that can cause programs to return incorrect answers or even crash, and fixing these bugs has often been low-priority: for example, as of 2020 the GNU C library bug database contained back-reference bugs 52, 10844, 11053, 24269 and 25322, with little sign of forthcoming fixes. Luckily, back-references are rarely useful and it should be little trouble to avoid them in practical applications.
正在尝试获取连续重复字母出现两次或三次的单词。无法使用 ERE
找到使用量词和捕获组的方法$ grep --version | head -n1
grep (GNU grep) 2.25
$ # consecutive repeated letters occurring twice
$ grep -m5 -xiE '[a-z]*([a-z])[a-z]*[a-z]*([a-z])[a-z]*' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed
$ # no output for this, why?
$ grep -m5 -xiE '([a-z]*([a-z])[a-z]*){2}' /usr/share/dict/words
尽管
-P
一起工作
$ grep -m5 -xiP '([a-z]*([a-z])[a-z]*){2}' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed
$ grep -m5 -xiP '([a-z]*([a-z])[a-z]*){3}' /usr/share/dict/words
Chattahoochee
McConnell
Mississippi
Mississippian
Mississippians
感谢 Casimir et Hippolyte 提出更简单的输入和正则表达式来测试此行为
$ echo 'aazbb' | grep -E '(([a-z])[a-z]*){2}' || echo 'No match'
aazbb
$ echo 'aazbbycc' | grep -E '(([a-z])[a-z]*){2}([a-z])[a-z]*' || echo 'No match'
aazbbycc
$ echo 'aazbbycc' | grep -P '(([a-z])[a-z]*){3}' || echo 'No match'
aazbbycc
$ # failing case
$ echo 'aazbbycc' | grep -E '(([a-z])[a-z]*){3}' || echo 'No match'
No match
sed
也有同样的行为
$ sed --version | head -n1
sed (GNU sed) 4.2.2
$ echo 'aazbb' | sed -E '/(([a-z])[a-z]*){2}/! s/.*/No match/'
aazbb
$ echo 'aazbbycc' | sed -E '/(([a-z])[a-z]*){2}([a-z])[a-z]*/! s/.*/No match/'
aazbbycc
$ # failing case
$ echo 'aazbbycc' | sed -E '/(([a-z])[a-z]*){3}/! s/.*/No match/'
No match
相关搜索链接,我查了一些,但没有得到任何接近这个问题的东西
如果新版本的 grep
或 sed
解决了这个问题,请告诉我。此外,如果在非 GNU 实现中出现此问题
我想 -E
不允许 Quantifiers
,这就是为什么它只适用于 -P
匹配 2 个或多个连续的重复字母组:
grep -P '(?:([a-z])*([a-z])){1}' /usr/share/dict/words
匹配 3 组或更多组连续的重复字母:
grep -P '(?:([a-z])*([a-z])){2}' /usr/share/dict/words
选项:
-P, --perl-regexp PATTERN is a Perl regular expression
更新
四处搜索后,我在 windows 盒子上安装了 gnugrep32,然后 运行
一些测试:
我从一个旧的 SO post 中读到这个:
Non-greedy matching is not part of the Extended Regular Expression syntax supported by grep
因此,我们使用 [a-z]{0,20}
作为测试而不是 [a-z]*
或 [a-z]*?
,其中 ?
被忽略(wtf?)
下面是使用总体 (){n}
的增量测试,看看它会在它之前走多远 STOPS BACKTRACKING
成帧。
下班时间
(([a-z])[a-z]{0,20}){1} len = 2 rr
(([a-z])[a-z]{0,20}){2} len = 4 rrrr
(([a-z])[a-z]{0,20}){3} len = 25 rrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])[a-z]{0,20}){4} len = 47 rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])[a-z]{0,20}){5} len = 69 rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])[a-z]{0,20}){6} len = 91 rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
从 {3}
到 {6}
增量长度等于 22。
这恰好是捕获帧表达式的全长([a-z])[a-z]{0,20}
当它不回溯到之前的帧时。
结论是它会在 2 帧后自动停止 回溯。
这是有道理的,例如,在 20 帧中,它达到 16 帧,发现无法匹配。
它是否应该回到第 1 帧并在那里进行调整,然后重新尝试一遍。
为什么是的。
不过现在消耗了这么多内存,臃肿的猪只好解压了。
使用这个陈旧的实用程序可能需要很长时间。
嘿,最好将其限制为 2 帧。
当然,(([a-z])[a-z]*){3}
没有测试用例,因为贪心量词*
如果它们都是 [a-z] 并且永远不会
,则会在第二帧消耗整行
开始第三帧。
$ # no output for this, why?
$ grep -m5 -xiE '([a-z]*([a-z])[a-z]*){2}' /usr/share/dict/words
因为您搜索的是内部有(至少)双字母的双组(两次相同)。像 abbcabbc
[(...) = "abbc"
2 次] 而不是 2 个(最终相似的)组,每个组里面都有一个双字母,比如 abbcdeef
.
有 2 个返回参考:
$ grep -iE '[a-z]*([a-z]){1,}[a-z]*([a-z]){1,}[a-z]*`
我提交了一个问题 https://debbugs.gnu.org/cgi/bugreport.cgi?bug=26864 并且现在更新手册以反映此类问题。
来自https://www.gnu.org/software/grep/manual/grep.html#Known-Bugs:
Back-references can greatly slow down matching, as they can generate exponentially many matching possibilities that can consume both time and memory to explore. Also, the POSIX specification for back-references is at times unclear. Furthermore, many regular expression implementations have back-reference bugs that can cause programs to return incorrect answers or even crash, and fixing these bugs has often been low-priority: for example, as of 2020 the GNU C library bug database contained back-reference bugs 52, 10844, 11053, 24269 and 25322, with little sign of forthcoming fixes. Luckily, back-references are rarely useful and it should be little trouble to avoid them in practical applications.