Python 正则表达式忽略模式

Question

我有如下两个关键字的列表：

keywords = ["Azure", "Azure cloud"]

但是python找不到第二个关键字"Azure cloud"

>>> keywords = ["Azure", "Azure cloud"]
>>> r = re.compile('|'.join([re.escape(w) for w in keywords]), flags=re.I)
>>> word = "Azure and Azure cloud"
>>> r.findall(word)
['Azure', 'Azure']

我期待这样的输出：['Azure'、'Azure'、'Azure cloud']

任何 guide/help 将不胜感激！

Answer 1

您可以运行多重搜索。

import itertools
import re

keywords = ["Azure", "Azure cloud"]
patterns = [re.compile(re.escape(w), flags=re.I) for w in keywords]
word = "Azure and Azure cloud"
results = list(itertools.chain.from_iterable(
    r.findall(word) for r in patterns
))
print(results)

输出：

['Azure', 'Azure', 'Azure cloud']

附加

if I'd have word = "Azure and azure cloud" - I will have the output as ['Azure', 'azure', 'azure cloud'] - so the 2nd keyword "azure" which is in small, if i would have to get the exact word matching with the "keywords" list which is "Azure", what modification has to be made in the code?

标志re.I表示忽略大小写。所以只需删除它。

patterns = [re.compile(re.escape(w)) for w in keywords]

追加 2

sorry my last comment was vague, so I want the pattern matching to ignore the case, but while fetching the output I would want the keywords to have exact case what we have in the "keyword" list and not in the "word"

抱歉造成误会。试试这个：

import re

keywords = ["Azure", "azure cloud"]
patterns = [re.compile(w, flags=re.I) for w in keywords]
word = "Azure and azure cloud"
results = [
    match_obj.re.pattern
    for r in patterns
    for match_obj in r.finditer(word)
]
print(results)

输出：

['Azure', 'Azure', 'azure cloud']

我不确定它是否有效，但它确实有效。
请注意，我删除 re.escape 因为它会导致 space 转义，所以结果是：

['Azure', 'Azure', 'azure\ cloud']

Answer 2

findall 查找所有 非重叠匹配项 。在交替的情况下，它会尝试各种情况从左到右。

所以这里发生的是正则表达式引擎达到 Azure cloud，设法匹配 Azure 并且...开始在 cloud 中再次寻找它，因为它成功匹配Azure 某事。

如果您希望 "Azure and Azure cloud" 产生 "Azure"、"Azure" 和 "Azure Cloud"，您需要运行每个模式单独，而不是一个单一的交替模式。

Python 正则表达式忽略模式

Python regex ignoring pattern

python

regex

pattern-matching

附加

追加 2