用'.'分割当前面没有数字时

Question

我要拆分'10.1 This is a sentence. Another sentence.' 作为 ['10.1 This is a sentence', 'Another sentence'] 并将 '10.1. This is a sentence. Another sentence.' 拆分为 ['10.1. This is a sentence', 'Another sentence']

我试过了

s.split(r'\D.\D')

不行，请问如何解决？

Answer 1

您有多个问题：

您没有使用 re.split()，您使用的是 str.split()。
您还没有转义 .，请改用 \.。
您没有使用前瞻和后视，所以您的 3 个字符消失了。

固定码：

>>> import re
>>> s = '10.1 This is a sentence. Another sentence.'
>>> re.split(r"(?<=\D\.)(?=\D)", s)
['10.1 This is a sentence.', ' Another sentence.']

基本上，(?<=\D\.) 会在具有 non-digit 字符的 . 之后找到一个位置。 (?=\D) 然后确保当前位置后有一个非数字。当一切都适用时，它会正确拆分。

Answer 2

如果您计划在 . 字符上拆分一个字符串，该字符前面或后面都没有数字，并且不在字符串的末尾，则拆分方法可能适合您：

re.split(r'(?<!\d)\.(?!\d|$)', text)

参见regex demo。

如果您的字符串可以包含更多特殊情况，您可以使用更可定制的提取方法：

re.findall(r'(?:\d+(?:\.\d+)*\.?|[^.])+', text)

参见 this regex demo。详情:

(?:\d+(?:\.\d+)*\.?|[^.])+ - 匹配一次或多次出现的 non-capturing 组
- \d+(?:\.\d+)*\.? - 一个或多个数字 (\d+)，然后是零个或多个 . 序列和一个或多个数字 ((?:\.\d+)*)，然后是可选的. 字符 (\.?)
- | - 或
- [^.] - . 字符以外的任何字符。

Answer 3

所有句子（最后一个句子除外）都以句号结尾，后跟 space，因此请按此分开。担心条款编号倒退。您可能会发现各种您不想要的情况，但描述您确实想要的情况通常要容易得多。在这种情况下 '。 ' 就是那种情况。

import re

doc = '10.1 This is a sentence. Another sentence.'

def sentences(doc):
    #split all sentences
    s = re.split(r'\.\s+', doc)

    #remove empty index or remove period from absolute last index, if present
    if s[-1] == '':
        s     = s[0:-1]
    elif s[-1].endswith('.'):
        s[-1] = s[-1][:-1]

    #return sentences
    return s

print(sentences(doc))

我构建 regex 的方式还应该消除段落之间的任意白色space。

用'.'分割当前面没有数字时

Split by '.' when not preceded by digit

python

regex

python-re