用于查找以大写字母开头并以“-”或“”连接的 >=1 个单词链的正则表达式

Question

我想获取至少 1 个以大写字母开头后跟小写字母的单词的所有仅字母“链”，并与 space (" ") 或“-”（“链”不能与“-”和“”相连）

例如，对于以下文本：

For the First Stage, you should press Start and you should follow Step-One and Step-Three. For the Final Stage, you must follow the sequence of One-Two-Five-Seven Steps

我的输出应该是

["For", "First Stage", "Start", "Step-One", "Step-Three", "Final Stage", "One-Two-Five-Seven", "Steps"]

到目前为止，我已经尝试编写 2 个不同的正则表达式来解决我的问题；第一个字符串应该 return “chains”与“-”连接，第二个应该return “chains”与“”连接：

import re
list(set(re.findall('([A-Z][a-z]+-)*[A-Z][a-z]+', mystring) + re.findall('([A-Z][a-z]+ )*[A-Z][a-z]+', mystring)))

但是，我猜他们有问题，因为他们都没有正常工作。

Answer 1

您可以使用

\b[A-Z][a-z]+(?=([-\s]?))(?:[A-Z][a-z]+)*\b(?!-[A-Z])

见regex demo。详情:

\b - 单词边界
[A-Z][a-z]+ - 一个大写 ASCII 字母后跟一个或多个小写 ASCII 字母
(?=([-\s]?)) - 正向前瞻，需要 - 或空白字符（1 次或 0 次，可选）紧邻当前位置的右侧，将字符捕获到组 1
(?:[A-Z][a-z]+)* - 零次或多次重复
- </code> - 与第 1 组中捕获的文本相同</li> <li><code>[A-Z][a-z]+ - 一个大写 ASCII 字母后跟一个或多个小写 ASCII 字母
\b(?!-[A-Z]) - 不跟 - 和大写 ASCII 字母的单词边界。

见Python demo:

import re
pattern = r"\b[A-Z][a-z]+(?=([-\s]?))(?:[A-Z][a-z]+)*\b(?!-[A-Z])"
text = "For the First Stage, you should press Start and you should follow Step-One and Step-Three. For the Final Stage, you must follow the sequence of steps One-Two-Five Seven // Steps One-Two-Five-Seven"
print( list(set([x.group() for x in re.finditer(pattern, text)])) )
# => ['Step-Three', 'For', 'First Stage', 'Seven', 'One-Two-Five-Seven', 'Start', 'One-Two-Five', 'Steps', 'Step-One', 'Final Stage']

用于查找以大写字母开头并以“-”或“”连接的 >=1 个单词链的正则表达式

Regex for finding chains of >=1 words starting with capital letters and connected with "-" or " "

python

regex

python-re