相同的正则表达式，但 Pandas 与 R 中的结果不同

Question

考虑这个旨在提取标题的简单正则表达式

(\w[\w-]+){2,}

运行它在 Python (Pandas) 与 R (stringr) 中给出完全不同的结果！

在 stringr 中提取工作正常：查看 'this-is-a-very-nice-test' 是如何正确解析的

library(stringr)
> str_extract_all('Whosebug.stack.com/read/this-is-a-very-nice-test', 
+                 regex('(\w[-\w]+){2,}'))
[[1]]
[1] "Whosebug"            "stack"                    "read"                     "this-is-a-very-nice-test"

在Pandas中，嗯，输出有点费解

myseries = pd.Series({'text' : 'Whosebug.stack.com/read/this-is-a-very-nice-test'})

myseries.str.extractall(r'(\w[-\w]+){2,}')
Out[51]: 
             0
     match    
text 0      ow
     1      ck
     2      ad
     3      st

这里有什么问题？

谢谢！

Answer 1

将这部分“{2,}”更改为“{1,}”后，这是按预期工作的

import re
s = 'Whosebug.stack.com/read/this-is-a-very-nice-test'
out = re.findall(r'(\w[-\w]+){1,}', s)
print(out)

输出：

['Whosebug', 'stack', 'com', 'read', 'this-is-a-very-nice-test']

编辑： 从python角度的解释： 重复限定符 {m,n}，其中 m 和 n 是十进制整数。此限定符意味着必须至少有 m 次重复，最多 n.

在您之前的示例“{2,}”中，您将 m=2 和 n 设置为无穷大，这意味着模式应至少重复 2 次，但是如果你像“{1,}”一样设置 m=1，它会接受一次出现，它也等同于“+”，即你可以替换 r'(\w[-\w]+){1, }' 到 (r'(\w[-\w]+)+' 仍然得到相同的结果

Answer 2

(\w[-\w]+){2,} 正则表达式表示 repeated capturing group:

The repeated capturing group will capture only the last iteration

请参阅 regex demo, the substrings highlighted are the values you get in Pandas with .extractall，因为此方法需要“带捕获组的正则表达式模式”和returns“a DataFrame，每场比赛一行，每组一列”。

与Pandasextractall相反，R stringr::str_extract_all在其结果中省略了所有捕获的子字符串，仅“提取所有匹配项和returns一个字符向量列表”。

相同的正则表达式，但 Pandas 与 R 中的结果不同

same regex but different results in Pandas vs. R

python

regex

r

pandas

stringr