匹配 \w: 占位符之间的所有文本

Question

我需要在未定义的 \w: 模式之间匹配文本（所以 n: text、foo: text 和 n: text foo: more text，下面的测试脚本中有更多示例）。

为此，我使用了 python 的 finditer 和一个正则表达式，但我无法在占位符之间捕获更多的多个单词。如何调整正则表达式或 finditer 方法来执行我想要的操作？

import re

def test_query_parse_regex(query, expected_result):
    result = {}

    # perform the matching here, this needs to change
    r = r"([\w-]+):\s?([\w-]*)"
    matches = re.finditer(r, query)

    for match in matches:
        # eg 'n'
        operator = match.group(1).strip()
        # eg 'text'
        operator_value = match.group(2).strip()

    # build a dict for comparison
    result[operator] = operator_value
    if result == expected_result:
        print(f"PASS: {query}")
    else:
        print(f"FAIL: {query}")
        print(f"  Expected: {expected_result}")
        print(f"  Got     : {result}")


checks = [
    # Query, expected
    ("n: tom", {"n": "tom"}),
    ("n: tom preston", {"n": "tom preston"}),
    ("n: tom l: london", {"n": "tom", "l": "london"}),
    ("n: tom preston l: london derry", {"n": "tom preston", "l": "london derry"}),
]

for check in checks:
    test_query_parse_regex(*check)

注意。我已经尝试过积极展望未来，但也无法做到这一点：r"([\w-]+):\s?([\w-]*)(?=\w:)"

Answer 1

您可以使用

r = r"([\w-]+):\s*(.*?)(?=[\w-]+:|$)"
r = r"([\w-]+):\s*(.*?)(?=[\w-]+:|\Z)"

请注意，如果您的字符串可以有换行符，您还需要将 re.finditer 部分修改为

re.finditer(r, query, re.DOTALL)

见regex demo。如果您使用 re.M 或 re.MULTILINE 选项，请首选带有 \Z 的版本，因为 \Z 始终匹配字符串的最末尾。

详情:

([\w-]+) - 第 1 组：一个或多个单词或连字符
:\s* - 一个冒号和任何零个或多个空格
(.*?) - 第 2 组：除换行符以外的零个或多个字符（如果未使用 re.DOTALL）尽可能少
(?=[\w-]+:|\Z) - 正前瞻要求一个或多个单词或连字符后跟冒号或字符串结尾，紧跟在当前位置的右侧。

匹配 \w: 占位符之间的所有文本

Match all text between \w: placeholders

python

regex