在 python 中使用具有正向后视的正则表达式拆分字符串
Using regex with a positive look-behind to split strings in python
为了解决其中一条评论,我的总体目标是了解如何实现一个正则表达式,使我能够在正面或负面的后视中利用单词边界,因为您似乎不能使用量词。
因此,对于我的具体情况,我希望能够检查句点 ('.') 之前的单词是否为大写单词。因此,我可以在脑海中从两条不同的路径来解决这个问题:
1) 正面查看“.”之前的单词都是小写的,但是我收到错误消息,正后视是零宽度,因此我不能像这样使用量词“+”:(?<=[^A-Z][a-z]+)
2) '.' 前面的词的否定回顾以大写字母开头,例如:(?<![A-Z][a-z])
我更愿意对选项 1 进行一些调整,因为它对我来说更有意义,但对其他建议持开放态度。我可以在这里使用单词边界吗?
我正在使用它最终将段落拆分成相应的句子,我想坚持使用正则表达式而不是使用 nltk。问题主要在于处理首字母或名字的缩写。
当前正则表达式:
(?<=[^A-Z][a-z])\.(?=\s[A-Z]+)
输入:
Koehler rides the bus. Bowman was passed into the first grade; Koehler advanced to third grade. Jon. Williams walked down the road to school. Bowman decided to go fishing; Koehler did not. C. Robinson asked to go to recess, and the teacher said no.
期望的输出:
Koehler rides the bus.
Bowman was passed into the first grade; Koehler advanced to third grade.
Jon. Williams walked down the road to school.
Bowman decided to go fishing; Koehler did not.
C. Robinson asked to go to recess, and the teacher said no.
对于您的具体情况,我会推荐 re.sub
。您的正则表达式以这种方式简化了 lot,并且您不需要使用后视,因为这些有很多限制(需要固定宽度等等)。
代码
print(re.sub(r'(\b[a-z]+\.\s*(?!$))', r'\n', text, re.M))
输出
Koehler rides the bus.
Bowman was passed into the first grade; Koehler advanced to third grade.
Jon. Williams walked down the road to school.
Bowman decided to go fishing; Koehler did not.
C. Robinson asked to go to recess, and the teacher said no.
正则表达式详细信息
( # first capture group
\b # word boundary
[a-z]+ # lower case a-z
\. # literal period
\s* # any other whitespace characters (added for cosmetic effect)
(?!$) # negative lookahead - don't insert a newline when you're at the end of a sentence
)
此模式替换为:
# reference to the first capture group
\n # a newline
如果您想创建一个句子列表,这里有另一个选项:
# Split into sentences (last word is split off too)
temp = re.split('( [a-z]+\.)', text)
temp = filter(bool, temp)
['Koehler rides the', ' bus.', ' Bowman was passed into the first grade; Koehler advanced to third', ' grade.', ' Jon. Williams walked down the road to', ' school.', ' Bowman decided to go fishing; Koehler did', ' not.', ' C. Robinson asked to go to recess, and the teacher said', ' no.']
# Join the pieces back together
sentences = [''.join([temp[i], temp[i + 1]]).strip() for i in range(0, len(temp), 2)]
['Koehler rides the bus.', 'Bowman was passed into the first grade; Koehler advanced to third grade.', 'Jon. Williams walked down the road to school.', 'Bowman decided to go fishing; Koehler did not.', 'C. Robinson asked to go to recess, and the teacher said no.']
尝试
mystr="Koehler rides the bus. Bowman was passed into the first grade; Koehler advanced to third grade. Jon. Williams walked down the road to school. Bowman decided to go fishing; Koehler did not. C. Robinson asked to go to recess, and the teacher said no."
lst=re.findall(r'.+?\b(?![A-Z])\w+\.',mystr)
如果多行则使用下面的:-
lst=re.findall(r'.+?(?:$|\b(?![A-Z])\w+\b\.)',mystr,re.M)
两者都会产生...
['Koehler rides the bus.', ' Bowman was passed into the first grade; Koehler advanced', 'to third grade.', ' Jon. Williams walked down the road to school.', ' Bowman decided to go fishing; Koehler did not.', ' C. Robinson asked to go to recess, and the teacher said no.']
'.+?\b(?![A-Z])\w+\.'
的解释
.+? #As minimal of characters as possible after the end of previous match, this makes sure we have as many distinct sentences
\b #word boundary
(?![A-Z]) #negative lookahead => don't follow \b with [A-Z] => skip capitalized words
\w+ #the whole word
\. #followed by a dot
为了解决其中一条评论,我的总体目标是了解如何实现一个正则表达式,使我能够在正面或负面的后视中利用单词边界,因为您似乎不能使用量词。
因此,对于我的具体情况,我希望能够检查句点 ('.') 之前的单词是否为大写单词。因此,我可以在脑海中从两条不同的路径来解决这个问题:
1) 正面查看“.”之前的单词都是小写的,但是我收到错误消息,正后视是零宽度,因此我不能像这样使用量词“+”:(?<=[^A-Z][a-z]+)
2) '.' 前面的词的否定回顾以大写字母开头,例如:(?<![A-Z][a-z])
我更愿意对选项 1 进行一些调整,因为它对我来说更有意义,但对其他建议持开放态度。我可以在这里使用单词边界吗?
我正在使用它最终将段落拆分成相应的句子,我想坚持使用正则表达式而不是使用 nltk。问题主要在于处理首字母或名字的缩写。
当前正则表达式:
(?<=[^A-Z][a-z])\.(?=\s[A-Z]+)
输入:
Koehler rides the bus. Bowman was passed into the first grade; Koehler advanced to third grade. Jon. Williams walked down the road to school. Bowman decided to go fishing; Koehler did not. C. Robinson asked to go to recess, and the teacher said no.
期望的输出:
Koehler rides the bus.
Bowman was passed into the first grade; Koehler advanced to third grade.
Jon. Williams walked down the road to school.
Bowman decided to go fishing; Koehler did not.
C. Robinson asked to go to recess, and the teacher said no.
对于您的具体情况,我会推荐 re.sub
。您的正则表达式以这种方式简化了 lot,并且您不需要使用后视,因为这些有很多限制(需要固定宽度等等)。
代码
print(re.sub(r'(\b[a-z]+\.\s*(?!$))', r'\n', text, re.M))
输出
Koehler rides the bus.
Bowman was passed into the first grade; Koehler advanced to third grade.
Jon. Williams walked down the road to school.
Bowman decided to go fishing; Koehler did not.
C. Robinson asked to go to recess, and the teacher said no.
正则表达式详细信息
( # first capture group
\b # word boundary
[a-z]+ # lower case a-z
\. # literal period
\s* # any other whitespace characters (added for cosmetic effect)
(?!$) # negative lookahead - don't insert a newline when you're at the end of a sentence
)
此模式替换为:
# reference to the first capture group
\n # a newline
如果您想创建一个句子列表,这里有另一个选项:
# Split into sentences (last word is split off too)
temp = re.split('( [a-z]+\.)', text)
temp = filter(bool, temp)
['Koehler rides the', ' bus.', ' Bowman was passed into the first grade; Koehler advanced to third', ' grade.', ' Jon. Williams walked down the road to', ' school.', ' Bowman decided to go fishing; Koehler did', ' not.', ' C. Robinson asked to go to recess, and the teacher said', ' no.']
# Join the pieces back together
sentences = [''.join([temp[i], temp[i + 1]]).strip() for i in range(0, len(temp), 2)]
['Koehler rides the bus.', 'Bowman was passed into the first grade; Koehler advanced to third grade.', 'Jon. Williams walked down the road to school.', 'Bowman decided to go fishing; Koehler did not.', 'C. Robinson asked to go to recess, and the teacher said no.']
尝试
mystr="Koehler rides the bus. Bowman was passed into the first grade; Koehler advanced to third grade. Jon. Williams walked down the road to school. Bowman decided to go fishing; Koehler did not. C. Robinson asked to go to recess, and the teacher said no."
lst=re.findall(r'.+?\b(?![A-Z])\w+\.',mystr)
如果多行则使用下面的:-
lst=re.findall(r'.+?(?:$|\b(?![A-Z])\w+\b\.)',mystr,re.M)
两者都会产生...
['Koehler rides the bus.', ' Bowman was passed into the first grade; Koehler advanced', 'to third grade.', ' Jon. Williams walked down the road to school.', ' Bowman decided to go fishing; Koehler did not.', ' C. Robinson asked to go to recess, and the teacher said no.']
'.+?\b(?![A-Z])\w+\.'
.+? #As minimal of characters as possible after the end of previous match, this makes sure we have as many distinct sentences
\b #word boundary
(?![A-Z]) #negative lookahead => don't follow \b with [A-Z] => skip capitalized words
\w+ #the whole word
\. #followed by a dot