正则表达式 return 匹配加字符串直到下一个匹配
Regex return match plus string up until next match
目标:根据数字或小数匹配将文本分成列表,检索所有文本直到,但不包括下一个匹配。 Language/version: Python 3.8.5 使用 python re.findall() 我愿意接受其他建议。
文本示例(是的,全部在一行上):
1 Something Interesting here 2 More interesting text 2.1 An example of 2C19 a header 2.3 Another header example 2.4 another interesting header 10.1 header stuff 14 the last interesting 3A4 header
目标产出:
['1 Something Interesting here',
'2 More interesting text',
'2.1 An example of 2C19 a header',
'2.3 Another header example',
'2.4 another interesting header',
'10.1 header stuff',
'14 the last interesting 3A4 header'
]
我可以确定 大多数 个合适的 integer/decimal 起点:
(\d+\.\d+)|([^a-zA-Z]\d\d)|( \d )
我正在努力寻找一种方法来 return 匹配之间的文本加上匹配本身。
为了节省您的时间,here's my Regex sandbox
谢谢你
您可以使用正向先行表达式进行匹配,直到下一次匹配。
这是更新后的正则表达式 (sandbox):
\b(?:\d+(?:\.\d+)?)\b.*?(?=\b(?:\d+(?:\.\d+)?)\b|$)
在python中:
regex = r'\b(?:\d+(?:\.\d+)?)\b.*?(?=\b(?:\d+(?:\.\d+)?)\b|$)'
string = ' 1 Something Interesting here 2 More interesting text 2.1 An example of 2C19 a header 2.3 Another header example 2.4 another interesting header 10.1 header stuff 14 the last interesting 3A4 header'
result = re.findall(regex, string)
在这种情况下,result
将是:
>>> result
['1 Something Interesting here ',
'2 More interesting text ',
'2.1 An example of 2C19 a header ',
'2.3 Another header example ',
'2.4 another interesting header ',
'10.1 header stuff ',
'14 the last interesting 3A4 header']
请注意,此解决方案还会提取末尾的间距。如果你不想要这个间距,你可以在你的字符串上调用 strip
:
>>> [ match.strip() for match in result ]
['1 Something Interesting here',
'2 More interesting text',
'2.1 An example of 2C19 a header',
'2.3 Another header example',
'2.4 another interesting header',
'10.1 header stuff',
'14 the last interesting 3A4 header']
目标:根据数字或小数匹配将文本分成列表,检索所有文本直到,但不包括下一个匹配。 Language/version: Python 3.8.5 使用 python re.findall() 我愿意接受其他建议。
文本示例(是的,全部在一行上):
1 Something Interesting here 2 More interesting text 2.1 An example of 2C19 a header 2.3 Another header example 2.4 another interesting header 10.1 header stuff 14 the last interesting 3A4 header
目标产出:
['1 Something Interesting here',
'2 More interesting text',
'2.1 An example of 2C19 a header',
'2.3 Another header example',
'2.4 another interesting header',
'10.1 header stuff',
'14 the last interesting 3A4 header'
]
我可以确定 大多数 个合适的 integer/decimal 起点:
(\d+\.\d+)|([^a-zA-Z]\d\d)|( \d )
我正在努力寻找一种方法来 return 匹配之间的文本加上匹配本身。
为了节省您的时间,here's my Regex sandbox
谢谢你
您可以使用正向先行表达式进行匹配,直到下一次匹配。
这是更新后的正则表达式 (sandbox):
\b(?:\d+(?:\.\d+)?)\b.*?(?=\b(?:\d+(?:\.\d+)?)\b|$)
在python中:
regex = r'\b(?:\d+(?:\.\d+)?)\b.*?(?=\b(?:\d+(?:\.\d+)?)\b|$)'
string = ' 1 Something Interesting here 2 More interesting text 2.1 An example of 2C19 a header 2.3 Another header example 2.4 another interesting header 10.1 header stuff 14 the last interesting 3A4 header'
result = re.findall(regex, string)
在这种情况下,result
将是:
>>> result
['1 Something Interesting here ',
'2 More interesting text ',
'2.1 An example of 2C19 a header ',
'2.3 Another header example ',
'2.4 another interesting header ',
'10.1 header stuff ',
'14 the last interesting 3A4 header']
请注意,此解决方案还会提取末尾的间距。如果你不想要这个间距,你可以在你的字符串上调用 strip
:
>>> [ match.strip() for match in result ]
['1 Something Interesting here',
'2 More interesting text',
'2.1 An example of 2C19 a header',
'2.3 Another header example',
'2.4 another interesting header',
'10.1 header stuff',
'14 the last interesting 3A4 header']