前瞻后如何让我的正则表达式匹配停止?
How to make my regex match stop after a lookahead?
我有一些来自 pdf 的文本在一个字符串中,我想将其分解以便我有一个列表,其中每个字符串都以一个数字和一个句点开头,然后在下一个数字之前停止。
例如我想转这个:
'3.1 First liens 15,209,670,396 0 15,209,670,396 14,216,703,858
3.2 Other than first liens 0 0
4. Real estate:
4.1 Properties occupied by the company (less $ 43,332,898
encumbrances) 68,122,291 0 68,122,291 64,237,046
4.2 Properties held for the production of income (less
$ encumbrances) 0 0
4.3 Properties held for sale (less $
encumbrances) 0 0
5. Cash ($ (101,130,138)), cash equivalents
($ 850,185,973 ) and short-term
investments ($ 0 ) 749,055,835 0 749,055,835 1,867,997,055
6. Contract loans (including $ premium notes) 253,533,676 0 253,533,676 233,680,271
7. Derivatives 3,194,189,871 0 3,194,189,871 2,390,781,023
8. Other invested assets 749,074,191 11,899,360 737,174,831 692,916,503'
进入这个:
['3.1 First liens 15,209,670,396 0 15,209,670,396 14,216,703,858 ',
'3.2 Other than first liens 0 0 ',
'4. Real estate:',
'4.1 Properties occupied by the company (less $ 43,332,898 encumbrances) 68,122,291 0 68,122,291 64,237,046',
'4.2 Properties held for the production of income (less $ encumbrances) 0 0'
'4.3 Properties held for sale (less $ encumbrances) 0 0',
'5. Cash ($ (101,130,138)), cash equivalents ($ 850,185,973 ) and short-term investments ($ 0 )
749,055,835 0 749,055,835 1,867,997,055',
'6. Contract loans (including $ premium notes) 253,533,676 0 253,533,676 233,680,271',
'7. Derivatives 3,194,189,871 0 3,194,189,871 2,390,781,023',
'8. Other invested assets 749,074,191 11,899,360 737,174,831 692,916,503']
问题是原始字符串在标题中间散布了'\n'(例如在 4.1 中,单词 encumbrances 之前有一个 \n。
(\d+\.[\s\S]*(?!\d+\.))
这是我一直在尝试使用的正则表达式,但它匹配整个字符串而不是每个数字行。有没有什么方法可以让我的正则表达式在下一个数字行之前停止匹配?
类似于:
list = re.findall(r"^\d+\..*?(?=^\d+\.|\Z)", text, re.MULTILINE | re.DOTALL)
根据要求提供进一步说明。
import re
txt = '''3.1 First liens 15,209,670,396 0 15,209,670,396 14,216,703,858
3.2 Other than first liens 0 0
4. Real estate:
4.1 Properties occupied by the company (less $ 43,332,898
encumbrances) 68,122,291 0 68,122,291 64,237,046
4.2 Properties held for the production of income (less
$ encumbrances) 0 0
4.3 Properties held for sale (less $
encumbrances) 0 0
5. Cash ($ (101,130,138)), cash equivalents
($ 850,185,973 ) and short-term
investments ($ 0 ) 749,055,835 0 749,055,835 1,867,997,055
6. Contract loans (including $ premium notes) 253,533,676 0 253,533,676 233,680,271
7. Derivatives 3,194,189,871 0 3,194,189,871 2,390,781,023
8. Other invested assets 749,074,191 11,899,360 737,174,831 692,916,503'''
x = re.split('[0-9]+\.[0-9]*', txt)
y = re.findall('[0-9]+\.[0-9]*', txt)
z = []
for i in range(len(y)):
t = y[i]+x[i+1]
z.append(t)
print(z)
只有需要换行的时候才用空格替换
循环通过找到的每个捕获组:
^[\']?(?=[\d].)[\d].[\d]*([\s\w\,\:\(\)$\-]*)[\']?[ ]*(\n|\Z)
我有一些来自 pdf 的文本在一个字符串中,我想将其分解以便我有一个列表,其中每个字符串都以一个数字和一个句点开头,然后在下一个数字之前停止。
例如我想转这个:
'3.1 First liens 15,209,670,396 0 15,209,670,396 14,216,703,858
3.2 Other than first liens 0 0
4. Real estate:
4.1 Properties occupied by the company (less $ 43,332,898
encumbrances) 68,122,291 0 68,122,291 64,237,046
4.2 Properties held for the production of income (less
$ encumbrances) 0 0
4.3 Properties held for sale (less $
encumbrances) 0 0
5. Cash ($ (101,130,138)), cash equivalents
($ 850,185,973 ) and short-term
investments ($ 0 ) 749,055,835 0 749,055,835 1,867,997,055
6. Contract loans (including $ premium notes) 253,533,676 0 253,533,676 233,680,271
7. Derivatives 3,194,189,871 0 3,194,189,871 2,390,781,023
8. Other invested assets 749,074,191 11,899,360 737,174,831 692,916,503'
进入这个:
['3.1 First liens 15,209,670,396 0 15,209,670,396 14,216,703,858 ',
'3.2 Other than first liens 0 0 ',
'4. Real estate:',
'4.1 Properties occupied by the company (less $ 43,332,898 encumbrances) 68,122,291 0 68,122,291 64,237,046',
'4.2 Properties held for the production of income (less $ encumbrances) 0 0'
'4.3 Properties held for sale (less $ encumbrances) 0 0',
'5. Cash ($ (101,130,138)), cash equivalents ($ 850,185,973 ) and short-term investments ($ 0 )
749,055,835 0 749,055,835 1,867,997,055',
'6. Contract loans (including $ premium notes) 253,533,676 0 253,533,676 233,680,271',
'7. Derivatives 3,194,189,871 0 3,194,189,871 2,390,781,023',
'8. Other invested assets 749,074,191 11,899,360 737,174,831 692,916,503']
问题是原始字符串在标题中间散布了'\n'(例如在 4.1 中,单词 encumbrances 之前有一个 \n。
(\d+\.[\s\S]*(?!\d+\.))
这是我一直在尝试使用的正则表达式,但它匹配整个字符串而不是每个数字行。有没有什么方法可以让我的正则表达式在下一个数字行之前停止匹配?
类似于:
list = re.findall(r"^\d+\..*?(?=^\d+\.|\Z)", text, re.MULTILINE | re.DOTALL)
根据要求提供进一步说明。
import re
txt = '''3.1 First liens 15,209,670,396 0 15,209,670,396 14,216,703,858
3.2 Other than first liens 0 0
4. Real estate:
4.1 Properties occupied by the company (less $ 43,332,898
encumbrances) 68,122,291 0 68,122,291 64,237,046
4.2 Properties held for the production of income (less
$ encumbrances) 0 0
4.3 Properties held for sale (less $
encumbrances) 0 0
5. Cash ($ (101,130,138)), cash equivalents
($ 850,185,973 ) and short-term
investments ($ 0 ) 749,055,835 0 749,055,835 1,867,997,055
6. Contract loans (including $ premium notes) 253,533,676 0 253,533,676 233,680,271
7. Derivatives 3,194,189,871 0 3,194,189,871 2,390,781,023
8. Other invested assets 749,074,191 11,899,360 737,174,831 692,916,503'''
x = re.split('[0-9]+\.[0-9]*', txt)
y = re.findall('[0-9]+\.[0-9]*', txt)
z = []
for i in range(len(y)):
t = y[i]+x[i+1]
z.append(t)
print(z)
只有需要换行的时候才用空格替换
循环通过找到的每个捕获组:
^[\']?(?=[\d].)[\d].[\d]*([\s\w\,\:\(\)$\-]*)[\']?[ ]*(\n|\Z)