如何提取数字之间的字符串？（并保留字符串中的第一个数字？）

Question

我正在尝试使用 RegEx 从更改日志中提取数据。这是更改日志的结构示例：

96545
this is some changes in the ticket
some new version: x.x.22
another change
new version: x.y.2.2
120091
this is some changes in the ticket
some new version: z.z.22
another change
another change
another change
new version: z.y.2.2
120092
...
...
...

每个数据点都以一个 ID 开头，ID 的范围为 5 到 6 位数字。
此外，日志中每个 ID 的更改（行）数量也是可变的。
每个数据点都以 new version: *** 结尾。 *** 是每个 ID 都可变的字符串。

我正在使用 RegExStrom Tester 来测试我的正则表达式。

到目前为止我有：^\w{5,6}(.|\n)*?\d{5,6} 但是结果包括下一张票的 ID，我需要避免这种情况。

结果：

96545
this is some changes in the ticket
some new version: x.x.22
another change
new version: x.y.2.2
120091

预期结果：

96545
this is some changes in the ticket
some new version: x.x.22
another change
new version: x.y.2.2

Answer 1

这样就可以了：

^\d{5,6}[\r\n]*.*?^new version:[^\r\n]*

只需确保通过 re.MULTILINE | re.DOTALL

启用 MULTILINE 和 DOTALL 标志

https://regex101.com/r/YeIUQx/1

Answer 2

捕获组 1 中的每个记录 ID 和组 2 中的内容

r'(?ms)^(\d{5,6}\r?\n)(.*?)^new version:'

https://regex101.com/r/A3ejjN/1

Answer 3

您的正则表达式很接近。它的问题是它在下一个日志的开头是 "ending"，通过使用 \d{5,6} 来标记日志条目的结尾（并在过程中匹配它）。正如 Wiktor 所提到的，使用 "new version" 作为分隔符会更有意义，所以我在这里做到了。

found_matches = re.findall("(^\d{5,6}[\s\S]*?^new version: .*$)", log_file_content, re.MULTILINE)

正则表达式 (^\d{5,6}[\s\S]*?^new version: .*$) 在行的开头搜索 5 或 6 位数字，然后取任何字符（包括换行符）直到出现在开头的 new version: 的第一个实例的一条线。然后它读取到行尾以完成该组。由于您要跨换行符进行匹配，因此请务必记住 re.MULTILINE 参数！

测试正则表达式 here, and the full python code here。

Answer 4

如果问题是您捕获了下一张票证的 ID，只需使用正向预测来处理它但不捕获它或使用它：

# end of tickets is the end of line that the line after it contains the Id of the next ticket
pattern = r"\d{5,6}[\s\S]*?(?=\n\d{5,6})"

# to extract first ticket info just use search
print(re.search(pattern, text).group(0))

# to extract all tickets info in a list use findall
print(re.findall(pattern, text))

# if the file is to big and you want to extract tickets in lazy mode
for ticket in re.finditer(pattern,text):
    print(ticket.group(0))

如何提取数字之间的字符串？（并保留字符串中的第一个数字？）

How to extract string between numbers? (And keep first number in the string?)

python

regex

text

changelog

text-mining

如何提取数字之间的字符串？ （并保留字符串中的第一个数字？）

How to extract string between numbers? (And keep first number in the string?)

python

regex

text

changelog

text-mining

如何提取数字之间的字符串？（并保留字符串中的第一个数字？）