固定句子:标点后加space,小数点或缩写后不加

Fixing sentences: add space after punctuation but not after decimal points or abbreviations

当句子没有大写并且标点符号被正确分隔时,我会处理非常混乱的文本。我需要在标点符号 [.,:;)!?] 后面缺少空格时添加空格,但不是十进制数字或缩写。

这是一个例子:

mystring = 'this is my first sentence with (brackets)in it. this is the second?What about this sentence with D.D.T. in it?or this with 4.5?'

这就是我到目前为止的进展。

def fix_punctuation(text):
    def sentence_case(text):
        # Split into sentences. Therefore, find all text that ends
        # with punctuation followed by white space or end of string.
        sentences = re.findall('[^.!?]+[.!?](?:\s|\Z)', text)

        # Capitalize the first letter of each sentence
        sentences = [x[0].upper() + x[1:] for x in sentences]

        # Combine sentences
        return ''.join(sentences)
    
    #add space after punctuation
    text = re.sub('([.,;:!?)])', r' ', text)
    #capitalize sentences
    text = sentence_case(text)
    
    return text

这给了我这个输出:

'This is my first sentence with (brackets) in it.  this is the second? What about this sentence with D. D. T.  in it? Or this with 4. 5? '

我尝试了 and here 建议的方法,但它们不适用于我的情况。 正则表达式让我的大脑受伤,所以我将非常感谢你的帮助。

你可以使用look ahead来检查点后面的字符是否不是数字,也不是后面跟着另一个点(缩写)的字符。您只需要将此应用于该点,并以不同方式对待其他行尾标点符号。但是你也不应该在 !?:

之间插入 space
text = re.sub(r"(\.)(?=[^\d\s.][^.])|([,;:!?)])(?=\w)", r" ", text)

你想要涵盖的场景越多,它就会变得越复杂。

我知道您想忽略数字中的句点和后面紧跟可选句点的以句点分隔的单字母块。

这是实现我上面描述的逻辑的代码片段:

import re

mystring = 'this is my first sentence with (brackets)in it. this is the second?What about this sentence with D.D.T. in it?or this with 4.5?'

def fix_punctuation(text):
    def sentence_case(text):
        # Split into sentences. Therefore, find all text that ends
        # with punctuation followed by white space or end of string.
        sentences = re.findall(r'(?:\d+\.\d+|\b[A-Z](?:\.[A-Z])*\b\.?|[^.!?])+[.!?](?:\s|\Z)', text)

        # Capitalize the first letter of each sentence
        sentences = [x[0].upper() + x[1:] for x in sentences]

        # Combine sentences
        return ''.join(sentences)
    
    #add space after punctuation
    text = re.sub(r'(\d+\.\d+|\b[A-Z](?:\.[A-Z])*\b\.?)|([.,;:!?)])\s*', lambda x: x.group(1) or f'{x.group(2)} ', text)
    #capitalize sentences
    return sentence_case(text)
    
print(fix_punctuation(mystring))
# => This is my first sentence with (brackets) in it. This is the second?
#    What about this sentence with D.D.T. in it? Or this with 4.5? 

参见Python demo

re.findall 模式,(?:\d+\.\d+|\b[A-Z](?:\.[A-Z])*\b\.?|[^.!?])+[.!?](?:\s|\Z),匹配

  • (?:\d+\.\d+|\b[A-Z](?:\.[A-Z])*\b\.?|[^.!?])+ - 出现一次或多次
    • \d+\.\d+ - 一位或多位,.,一位或多位
    • | - 或
    • \b[A-Z](?:\.[A-Z])*\b\.? - 单词边界、大写字母、句点和大写字母的零次或多次重复、单词边界和可选的 .
    • | - 或
    • [^.!?] - .!?
    • 以外的字符
  • [.!?] - .!?
  • (?:\s|\Z) - 白色space 或字符串结尾。

re.sub 模式,(\d+\.\d+|\b[A-Z](?:\.[A-Z])*\b\.?)|([.,;:!?)])\s*,将我们要跳过的那些模式匹配并捕获到第 1 组,然后匹配并捕获到第 2 组一些标点字符,然后匹配任何零或更多 whitespace 字符(以确保我们在它们之后只有一个 space),并且在 lambda 表达式中的替换参数中使用了自定义逻辑。