在 python 中使用具有正向后视的正则表达式拆分字符串

Using regex with a positive look-behind to split strings in python

为了解决其中一条评论,我的总体目标是了解如何实现一个正则表达式,使我能够在正面或负面的后视中利用单词边界,因为您似乎不能使用量词。

因此,对于我的具体情况,我希望能够检查句点 ('.') 之前的单词是否为大写单词。因此,我可以在脑海中从两条不同的路径来解决这个问题:

1) 正面查看“.”之前的单词都是小写的,但是我收到错误消息,正后视是零宽度,因此我不能像这样使用量词“+”:(?<=[^A-Z][a-z]+)

2) '.' 前面的词的否定回顾以大写字母开头,例如:(?<![A-Z][a-z])

我更愿意对选项 1 进行一些调整,因为它对我来说更有意义,但对其他建议持开放态度。我可以在这里使用单词边界吗?

我正在使用它最终将段落拆分成相应的句子,我想坚持使用正则表达式而不是使用 nltk。问题主要在于处理首字母或名字的缩写。

当前正则表达式:

(?<=[^A-Z][a-z])\.(?=\s[A-Z]+)

输入:

Koehler rides the bus. Bowman was passed into the first grade; Koehler advanced to third grade. Jon. Williams walked down the road to school. Bowman decided to go fishing; Koehler did not. C. Robinson asked to go to recess, and the teacher said no.

期望的输出:

Koehler rides the bus.
Bowman was passed into the first grade; Koehler advanced to third grade.
Jon. Williams walked down the road to school.
Bowman decided to go fishing; Koehler did not.
C. Robinson asked to go to recess, and the teacher said no.

对于您的具体情况,我会推荐 re.sub。您的正则表达式以这种方式简化了 lot,并且您不需要使用后视,因为这些有很多限制(需要固定宽度等等)。

代码

print(re.sub(r'(\b[a-z]+\.\s*(?!$))', r'\n', text, re.M))

输出

Koehler rides the bus. 
Bowman was passed into the first grade; Koehler advanced to third grade. 
Jon. Williams walked down the road to school. 
Bowman decided to go fishing; Koehler did not. 
C. Robinson asked to go to recess, and the teacher said no.

正则表达式详细信息

(         # first capture group
\b        # word boundary
[a-z]+    # lower case a-z
\.        # literal period
\s*       # any other whitespace characters (added for cosmetic effect)
(?!$)     # negative lookahead - don't insert a newline when you're at the end of a sentence
)

此模式替换为:

        # reference to the first capture group 
\n        # a newline

如果您想创建一个句子列表,这里有另一个选项:

# Split into sentences (last word is split off too)    
temp = re.split('( [a-z]+\.)', text)
temp = filter(bool, temp)

['Koehler rides the', ' bus.', ' Bowman was passed into the first grade; Koehler advanced to third', ' grade.', ' Jon. Williams walked down the road to', ' school.', ' Bowman decided to go fishing; Koehler did', ' not.', ' C. Robinson asked to go to recess, and the teacher said', ' no.']

# Join the pieces back together
sentences = [''.join([temp[i], temp[i + 1]]).strip() for i in range(0, len(temp), 2)]

['Koehler rides the bus.', 'Bowman was passed into the first grade; Koehler advanced to third grade.', 'Jon. Williams walked down the road to school.', 'Bowman decided to go fishing; Koehler did not.', 'C. Robinson asked to go to recess, and the teacher said no.']

尝试

mystr="Koehler rides the bus. Bowman was passed into the first grade; Koehler advanced to third grade. Jon. Williams walked down the road to school. Bowman decided to go fishing; Koehler did not. C. Robinson asked to go to recess, and the teacher said no."
lst=re.findall(r'.+?\b(?![A-Z])\w+\.',mystr)

如果多行则使用下面的:-

lst=re.findall(r'.+?(?:$|\b(?![A-Z])\w+\b\.)',mystr,re.M)

两者都会产生...

['Koehler rides the bus.', ' Bowman was passed into the first grade; Koehler advanced', 'to third grade.', ' Jon. Williams walked down the road to school.', ' Bowman decided to go fishing; Koehler did not.', ' C. Robinson asked to go to recess, and the teacher said no.']

'.+?\b(?![A-Z])\w+\.'

的解释
.+?       #As minimal of characters as possible after the end of previous match, this makes sure we have as many distinct sentences
\b        #word boundary
(?![A-Z]) #negative lookahead => don't follow \b with [A-Z] => skip capitalized words
\w+       #the whole word
\.        #followed by a dot

测试正则表达式 here.
测试码here.