用于查找以大写字母开头并以“-”或“”连接的 >=1 个单词链的正则表达式

Regex for finding chains of >=1 words starting with capital letters and connected with "-" or " "

我想获取至少 1 个以大写字母开头后跟小写字母的单词的所有仅字母“链”,并与 space (" ") “-”(“链”不能与“-”和“”相连)

例如,对于以下文本:

For the First Stage, you should press Start and you should follow Step-One and Step-Three. For the Final Stage, you must follow the sequence of One-Two-Five-Seven Steps

我的输出应该是

["For", "First Stage", "Start", "Step-One", "Step-Three", "Final Stage", "One-Two-Five-Seven", "Steps"]

到目前为止,我已经尝试编写 2 个不同的正则表达式来解决我的问题;第一个字符串应该 return “chains”与“-”连接,第二个应该return “chains”与“”连接:

import re
list(set(re.findall('([A-Z][a-z]+-)*[A-Z][a-z]+', mystring) + re.findall('([A-Z][a-z]+ )*[A-Z][a-z]+', mystring)))

但是,我猜他们有问题,因为他们都没有正常工作。

您可以使用

\b[A-Z][a-z]+(?=([-\s]?))(?:[A-Z][a-z]+)*\b(?!-[A-Z])

regex demo详情:

  • \b - 单词边界
  • [A-Z][a-z]+ - 一个大写 ASCII 字母后跟一个或多个小写 ASCII 字母
  • (?=([-\s]?)) - 正向前瞻,需要 - 或空白字符(1 次或 0 次,可选)紧邻当前位置的右侧,将字符捕获到组 1
  • (?:[A-Z][a-z]+)* - 零次或多次重复
    • </code> - 与第 1 组中捕获的文本相同</li> <li><code>[A-Z][a-z]+ - 一个大写 ASCII 字母后跟一个或多个小写 ASCII 字母
  • \b(?!-[A-Z]) - 不跟 - 和大写 ASCII 字母的单词边界。

Python demo:

import re
pattern = r"\b[A-Z][a-z]+(?=([-\s]?))(?:[A-Z][a-z]+)*\b(?!-[A-Z])"
text = "For the First Stage, you should press Start and you should follow Step-One and Step-Three. For the Final Stage, you must follow the sequence of steps One-Two-Five Seven // Steps One-Two-Five-Seven"
print( list(set([x.group() for x in re.finditer(pattern, text)])) )
# => ['Step-Three', 'For', 'First Stage', 'Seven', 'One-Two-Five-Seven', 'Start', 'One-Two-Five', 'Steps', 'Step-One', 'Final Stage']