正则表达式在标记之间拾取文本并包括最后一个标记

Regex to pick up text between markers and includes last mark

正在寻找一种快速解决方案来提取数字值后面的文本,如下所示:

要提取的文本

4.71. Firms should determine the frequency and intensity of monitoring on a risk-sensitive basis, 
taking into account the nature, size and complexity of their business and the level of risk to which they are exposed.   

4.72.  text 
4.9. text
4.9 addf
4.73.  text
4.74.  text 

解决方案

(?<=\d\.\d\d\.) [\w+\W*]*?(?=\r?\d\.\d\.*)

我还需要包括 4.74 之后的最后一个文本,该文本目前不起作用。

更新以在按字母顺序枚举后提取文本

问题:如何修改逻辑以捕获这些枚举后的文本?

a) text 
text

a. text text

ii. text

iv. text
iii. text text 

我们可以在这里使用re.findall如下:

inp = """4.71. Firms should determine the frequency and intensity of monitoring on a risk-sensitive basis, taking into account the nature, size and complexity of their business and the level of risk to which they are exposed.

4.72. text 4.9. text 4.9 addf 4.73. text 4.74. text"""

matches = re.findall(r'\b\d+(?:\.\d+)?\.? (.*?)\s*(?=\b\d+(?:\.\d+)?|$)', inp, flags=re.DOTALL)
print(matches)

这会打印:

['Firms should determine the frequency and intensity of monitoring on a risk-sensitive basis, taking into account the nature, size and complexity of their business and the level of risk to which they are exposed.',
 'text', 'text', 'addf', 'text', 'text']

想法是找到每个数字 header,然后只捕获它后面的内容。我们捕获所有内容,直到到达下一个 header 或到达文件末尾。

如果数字在字符串的开头,您可以匹配捕获后的所有不以数字和数字开头的行的数字。

请注意,它不匹配 4.9 addf,因为没有结尾 .

^\d+(?:\.\d+)+\. (.*(?:\r?\n(?!\d+\.).*)*)
  • ^ 字符串开头
  • \d+(?:\.\d+)+匹配1+位,重复一个.和1+位匹配结尾的.
  • \. 匹配 .
  • ( 捕获 组 1
    • .* 匹配行的其余部分
    • (?:非捕获组
      • \r?\n(?!\d+\.).*匹配一个换行符,其余行不以数字开头和.
    • )* 关闭群组并可选择重复
  • ) 关闭群组

Regex demo | Python demo