正则表达式查找多行字符串，其中包含行之间的另一个字符串

Question

我的第一个Q在这里。

我有一个日志文件，其中包含多个与命中相似的字符串：

Region: AR
OnlineID: Atl_Tuc
---Start---
FIFA 18 Legacy Edition
---END---

Region: FR
OnlineID: jubtrrzz
---Start---
FIFA 19
Undertale
Pro Evolution Soccer™ 2018
---END---

Region: US
OnlineID: Cu128yi
---Start---
KINGDOM HEARTS HD 1.5 +2.5 ReMIX
---END---

Region: RO
OnlineID: Se116
---Start---
Real Farm
EA SPORTS™ FIFA 20
LittleBigPlanet™ 3
---END---

Region: US
OnlineID: CAJ5Y
---Start---
Madden NFL 18: G.O.A.T. Super Bowl Edition
---END---

我想找到所有包含 fifa 的匹配项（fifa 作为字符串）。 Fifa 就是例子，我需要找到所有包含一些字符串的命中。

我能找到的最后一件事是这个正则表达式：(?s)(?=^\r\n)(.*?)(fifa)(.*?)(?=\r\n\r\n)

但是当我使用它时，它会选择所有命中，包括没有 fifa 的命中，直到它在命中中找到 fifa，所以它有时会选择超过 1 个命中 like this。

第二个问题是我不能在 (fifa) bcz 中使用 .* 它会导致错误的选择。

我现在可以做什么？

正确的输出应该是这样的：

Region: AR
OnlineID: Atl_Tuc
---Start---
FIFA 18 Legacy Edition
---END---

Region: FR
OnlineID: jubtrrzz
---Start---
FIFA 19
Undertale
Pro Evolution Soccer™ 2018
---END---

Region: RO
OnlineID: Se116
---Start---
Real Farm
EA SPORTS™ FIFA 20
LittleBigPlanet™ 3
---END---

Answer 1

你可以使用

(?si)(?:^(?<!.)|\R{2})\K(?:(?!\R{2}).)*?\bfifa\b.*?(?=\R{2}|\z)

见regex demo

详情

(?si) - s 使 . 匹配换行字符（与 . 匹配换行符 ON 相同）和大小写不敏感匹配开启
(?:^(?<!.)|\R{2}) - 匹配文件开头或两个换行符序列
\K - 省略匹配的换行符
(?:(?!\R{2}).)*? - 任何字符，出现 0 次或多次但尽可能少，不开始双换行序列
\bfifa\b - 整个单词 fifa
.*? - 任何 0+ 个字符尽可能少
(?=\R{2}|\z) - 直到双换行符或文件结尾。

现在，如果你想用 fifa 匹配一个段落，然后在它的某些行上匹配 20，请使用

(?si)(?:^(?<!.)|\R{2})\K(?:(?!\R{2}).)*?(?-s:\bfifa\b.*\b20\b).*?(?=\R{2}|\z)

(?-s:\bfifa\b.*\b20\b) 是一个修饰符组，其中 . 停止匹配换行符，它匹配整个单词 fifa，然后是除换行符以外的任何 0+ 个字符，如尽可能多，然后 20 作为一个完整的单词。

参见 this regex demo。

Answer 2

整个问题最好不要使用正则表达式。我会使用更简单的方法将日志文件切成小块，每段 1 块。

然后使用正则表达式查看每个段落是否“命中”。

这是一些 Python 代码：

# read the file contents into a string
log_text = open('/input/log/file/path/here', 'r').read().strip()

# split the string into separate paragraphs
paragraphs = log_text.split('\n\n')

# filter the paragraphs to the ones you want
filtered_paragraphs = filter(is_wanted, paragraphs)

# recombine the filtered paragraphs into a new log string
new_log_text = '\n\n'.join(filtered_paragraphs)

# output new log text into new file
open('/output/log/file/path/here', 'w').write(new_log_text)

当然你需要定义is_wanted函数：

import re

def is_wanted(paragraph):
    # discard first three and last line to get paragraph content
    p_content = '\n'.join(paragraph.split('\n')[3:-1])
    # input any regex pattern here, such as 'FIFA'.  You can pass it into the function as a variable if you need it to be customizable
    return bool(re.search(r'FIFA', p_content))

正则表达式查找多行字符串，其中包含行之间的另一个字符串

Regex to find a multi line string that includes another string between lines

regex

notepad++

regex-group

regex-lookarounds