用于查找以大写字母开头并以“-”或“”连接的 >=1 个单词链的正则表达式
Regex for finding chains of >=1 words starting with capital letters and connected with "-" or " "
我想获取至少 1 个以大写字母开头后跟小写字母的单词的所有仅字母“链”,并与 space (" ") 或“-”(“链”不能与“-”和“”相连)
例如,对于以下文本:
For the First Stage, you should press Start and you should follow
Step-One and Step-Three. For the Final Stage, you must follow the
sequence of One-Two-Five-Seven Steps
我的输出应该是
["For", "First Stage", "Start", "Step-One", "Step-Three", "Final
Stage", "One-Two-Five-Seven", "Steps"]
到目前为止,我已经尝试编写 2 个不同的正则表达式来解决我的问题;第一个字符串应该 return
“chains”与“-”连接,第二个应该return “chains”与“”连接:
import re
list(set(re.findall('([A-Z][a-z]+-)*[A-Z][a-z]+', mystring) + re.findall('([A-Z][a-z]+ )*[A-Z][a-z]+', mystring)))
但是,我猜他们有问题,因为他们都没有正常工作。
您可以使用
\b[A-Z][a-z]+(?=([-\s]?))(?:[A-Z][a-z]+)*\b(?!-[A-Z])
见regex demo。 详情:
\b
- 单词边界
[A-Z][a-z]+
- 一个大写 ASCII 字母后跟一个或多个小写 ASCII 字母
(?=([-\s]?))
- 正向前瞻,需要 -
或空白字符(1 次或 0 次,可选)紧邻当前位置的右侧,将字符捕获到组 1
(?:[A-Z][a-z]+)*
- 零次或多次重复
</code> - 与第 1 组中捕获的文本相同</li>
<li><code>[A-Z][a-z]+
- 一个大写 ASCII 字母后跟一个或多个小写 ASCII 字母
\b(?!-[A-Z])
- 不跟 -
和大写 ASCII 字母的单词边界。
import re
pattern = r"\b[A-Z][a-z]+(?=([-\s]?))(?:[A-Z][a-z]+)*\b(?!-[A-Z])"
text = "For the First Stage, you should press Start and you should follow Step-One and Step-Three. For the Final Stage, you must follow the sequence of steps One-Two-Five Seven // Steps One-Two-Five-Seven"
print( list(set([x.group() for x in re.finditer(pattern, text)])) )
# => ['Step-Three', 'For', 'First Stage', 'Seven', 'One-Two-Five-Seven', 'Start', 'One-Two-Five', 'Steps', 'Step-One', 'Final Stage']
我想获取至少 1 个以大写字母开头后跟小写字母的单词的所有仅字母“链”,并与 space (" ") 或“-”(“链”不能与“-”和“”相连)
例如,对于以下文本:
For the First Stage, you should press Start and you should follow Step-One and Step-Three. For the Final Stage, you must follow the sequence of One-Two-Five-Seven Steps
我的输出应该是
["For", "First Stage", "Start", "Step-One", "Step-Three", "Final Stage", "One-Two-Five-Seven", "Steps"]
到目前为止,我已经尝试编写 2 个不同的正则表达式来解决我的问题;第一个字符串应该 return “chains”与“-”连接,第二个应该return “chains”与“”连接:
import re
list(set(re.findall('([A-Z][a-z]+-)*[A-Z][a-z]+', mystring) + re.findall('([A-Z][a-z]+ )*[A-Z][a-z]+', mystring)))
但是,我猜他们有问题,因为他们都没有正常工作。
您可以使用
\b[A-Z][a-z]+(?=([-\s]?))(?:[A-Z][a-z]+)*\b(?!-[A-Z])
见regex demo。 详情:
\b
- 单词边界[A-Z][a-z]+
- 一个大写 ASCII 字母后跟一个或多个小写 ASCII 字母(?=([-\s]?))
- 正向前瞻,需要-
或空白字符(1 次或 0 次,可选)紧邻当前位置的右侧,将字符捕获到组 1(?:[A-Z][a-z]+)*
- 零次或多次重复</code> - 与第 1 组中捕获的文本相同</li> <li><code>[A-Z][a-z]+
- 一个大写 ASCII 字母后跟一个或多个小写 ASCII 字母
\b(?!-[A-Z])
- 不跟-
和大写 ASCII 字母的单词边界。
import re
pattern = r"\b[A-Z][a-z]+(?=([-\s]?))(?:[A-Z][a-z]+)*\b(?!-[A-Z])"
text = "For the First Stage, you should press Start and you should follow Step-One and Step-Three. For the Final Stage, you must follow the sequence of steps One-Two-Five Seven // Steps One-Two-Five-Seven"
print( list(set([x.group() for x in re.finditer(pattern, text)])) )
# => ['Step-Three', 'For', 'First Stage', 'Seven', 'One-Two-Five-Seven', 'Start', 'One-Two-Five', 'Steps', 'Step-One', 'Final Stage']