按位置而不是字符拆分字符串

Question

我们知道anchors、word boundaries、lookaround是匹配一个位置，而不是匹配一个字符。
是否可以使用正则表达式（特别是 python）通过上述方式之一拆分字符串？

例如考虑以下字符串：

"ThisisAtestForchEck,Match IngwithPosition."

所以我想要以下结果（以大写字母开头但不以 space 开头的子字符串）：

['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match Ingwith' ,'Position.']

如果我拆分分组我得到：

>>> re.split(r'([A-Z])',s)
['', 'T', 'hisis', 'A', 'test', 'F', 'orch', 'E', 'ck,', 'M', 'atchingwith', 'P', 'osition.']

这是环视的结果：

>>> re.split(r'(?<=[A-Z])',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z]))',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z])?)',s)
['ThisisAtestForchEck,MatchingwithPosition.']

请注意，如果我想拆分以大写字母开头且前面有 space 的子字符串，例如：

['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match ', Ingwith' ,'Position.']

我可以使用 re.findall，即：

>>> re.findall(r'([A-Z][^A-Z]*)',s)
['Thisis', 'Atest', 'Forch', 'Eck,', 'Match ', 'Ingwith', 'Position.']

但是第一个例子呢：可以用re.findall解决吗？

Answer 1

 (?<!\s)(?=[A-Z])

您可以使用它与正则表达式模块一起拆分，因为 re 不支持以 0 宽度断言拆分。

import regex
x="ThisisAtestForchEck,Match IngwithPosition."
print regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1)

或

print [i for i in regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1) if i]

查看演示。

https://regex101.com/r/sJ9gM7/65

Answer 2

尝试使用此模式捕获

([A-Z][a-z]*(?: [A-Z][a-z]*)*)

Demo

Answer 3

我知道这可能不太方便，因为结果的元组性质。但我认为这个 findall 找到了你需要的东西：

re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)
## returns [('Thisis', 's'), ('Atest', 't'), ('Forch', 'h'), ('Eck,', ','), ('Match Ingwith', 'h'), ('Position.', '.')]

这可用于以下列表推导式以提供所需的输出：

[val[0] for val in re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']

这是一个使用 split:

的 hack

re.split(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)[1::3]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']

Answer 4

re.findall的方法：

re.findall(r'(?:[A-Z]|^[^A-Z\s])[^A-Z\s]*(?:\s+[A-Z][^A-Z]*)*',s)

当您决定将方法从 split 更改为 findall 时，第一项工作是重新制定您的要求："I want to split the string on each uppercase letter non preceded by a space" => "I want to find one or more substrings separed by space that begins with an uppercase letter except from the start of the string (if the string doesn't start with an uppercase letter)"

按位置而不是字符拆分字符串

Split string by position not character

python

regex

split