两个字符串之间最长匹配序列的正则表达式

Question

我搜索了 Google 我的用例，但没有找到任何有用的东西。

我不是正则表达式方面的专家，所以如果社区中的任何人可以提供帮助，我将不胜感激。

问题：

给定一个文本文件，我想使用正则表达式捕获两个子字符串（前缀和后缀）之间最长的字符串。请注意，这两个子字符串将始终位于文本任何行的开头。请看下面的例子。

子字符串：

prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']

示例 1：

Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ....

预期结果：

Item 1a ....
....
....
....
....

为什么是这个结果？

因为 Item 1a 的前缀和 Item 2b 的后缀匹配所有其他前缀-后缀对中它们之间的文本中最长的字符串。

示例 2：

Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2
.... Item 1 ....
Item 2
Item 1a .... ....
....
....
.... Item 2b
....

预期结果：

Item 1 ....
....
....

为什么是这个结果？

这是因为这是两个字符串（前缀和后缀对）之间最大的字符串，其中前缀和后缀都从行首开始。请注意，还有另一对 (Item 1a-Item 2b) 但由于 Item 2b 不在行的开头，我们不能考虑这个最长的序列。

我对正则表达式的尝试：

我已尝试对上面列表中的每个前缀-后缀对使用以下正则表达式，但这没有用。

regexs = [r'^' + re.escape(pre) + '(.*?)' + re.escape(suf) for pre in prefixes for suf in suffixes]
for regex in regexs:
    re.findall(regex, text, re.MULTLINE)

我尝试使用非正则表达式（Python 字符串函数）：

def extract_longest_match(text, prefixes, suffixes):
    longest_match = ''
    for line in text.splitlines():
        if line.startswith(tuple(prefixes)):
            beg_index = text.index(line)
            for suf in suffixes:
                end_index = text.find(suf, beg_index+len(line))
                match = text[beg_index:end_index]
                if len(match) > len(longest_match ):
                    longest_match = match
    return longest_match

我错过了什么吗？

Answer 1

你需要

构建一个正则表达式，匹配从最左边的起始分隔符到最左边的尾随分隔符的字符串（参见）
确保分隔符匹配
确保 . 使用 re.DOTALL 或等效选项（参见 Python regex, matching pattern over multiple lines）匹配换行符字符
确保正则表达式匹配重叠的子字符串（参见 Python regex find all overlapping matches）
查找文本中的所有匹配项（参见 How can I find all matches to a regular expression in Python?）
获得最长的一个（参见 Python's most efficient way to choose longest string in list?）。

Python demo:

import re
s="""Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ...."""
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))
# Or, a version with word boundaries:
# rx = r"(?=^((?:{})\b.*?^(?:{})\b))".format("|".join(prefixes), "|".join(suffixes))
all_matches = re.findall(rx, s, re.S | re.M)
print(max(all_matches, key=len))

输出：

Item 1a ....
....
....
....
....
Item 2

正则表达式看起来像

(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))

有字边界

(?sm)(?=^((?:Item 1|Item 1a|Item 1b)\b.*?^(?:Item 2|Item 2a|Item 2b)\b))

参见regex demo。

详情

(?sm) - re.S 和 re.M 标志
(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))) - 在任何位置匹配的正面前瞻，紧随其后的是一系列模式：
- ^ - 行首
- ((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)) - 第 1 组（此值与 re.findall 一起返回）
- (?:Item 1|Item 1a|Item 1b) - 交替中的任何项目（可能，在此处 ) 之后添加 \b 字边界是有意义的）
- .*? - 任何 0+ 个字符，尽可能少
- ^ - 行首
- (?:Item 2|Item 2a|Item 2b) - 列表中的任何替代项（可能，在此处的 ) 之后添加 \b 字边界也是有意义的）。

两个字符串之间最长匹配序列的正则表达式

Regex for longest matching sequence between two strings

python

regex

string

pattern-matching

string-matching