Python 正则表达式忽略模式
Python regex ignoring pattern
我有如下两个关键字的列表:
keywords = ["Azure", "Azure cloud"]
但是python找不到第二个关键字"Azure cloud"
>>> keywords = ["Azure", "Azure cloud"]
>>> r = re.compile('|'.join([re.escape(w) for w in keywords]), flags=re.I)
>>> word = "Azure and Azure cloud"
>>> r.findall(word)
['Azure', 'Azure']
我期待这样的输出:['Azure'、'Azure'、'Azure cloud']
任何 guide/help 将不胜感激!
您可以 运行 多重搜索。
import itertools
import re
keywords = ["Azure", "Azure cloud"]
patterns = [re.compile(re.escape(w), flags=re.I) for w in keywords]
word = "Azure and Azure cloud"
results = list(itertools.chain.from_iterable(
r.findall(word) for r in patterns
))
print(results)
输出:
['Azure', 'Azure', 'Azure cloud']
附加
if I'd have word = "Azure and azure cloud" - I will have the output as ['Azure', 'azure', 'azure cloud'] - so the 2nd keyword "azure" which is in small, if i would have to get the exact word matching with the "keywords" list which is "Azure", what modification has to be made in the code?
标志re.I
表示忽略大小写。所以只需删除它。
patterns = [re.compile(re.escape(w)) for w in keywords]
追加 2
sorry my last comment was vague, so I want the pattern matching to ignore the case, but while fetching the output I would want the keywords to have exact case what we have in the "keyword" list and not in the "word"
抱歉造成误会。试试这个:
import re
keywords = ["Azure", "azure cloud"]
patterns = [re.compile(w, flags=re.I) for w in keywords]
word = "Azure and azure cloud"
results = [
match_obj.re.pattern
for r in patterns
for match_obj in r.finditer(word)
]
print(results)
输出:
['Azure', 'Azure', 'azure cloud']
我不确定它是否有效,但它确实有效。
请注意,我删除 re.escape 因为它会导致 space 转义,所以结果是:
['Azure', 'Azure', 'azure\ cloud']
findall
查找所有 非重叠匹配项 。在交替的情况下,它会尝试各种情况从左到右。
所以这里发生的是正则表达式引擎达到 Azure cloud
,设法匹配 Azure
并且...开始在 cloud
中再次寻找它,因为它成功匹配Azure
某事。
如果您希望 "Azure and Azure cloud" 产生 "Azure"、"Azure" 和 "Azure Cloud",您需要 运行每个模式单独,而不是一个单一的交替模式。
我有如下两个关键字的列表:
keywords = ["Azure", "Azure cloud"]
但是python找不到第二个关键字"Azure cloud"
>>> keywords = ["Azure", "Azure cloud"]
>>> r = re.compile('|'.join([re.escape(w) for w in keywords]), flags=re.I)
>>> word = "Azure and Azure cloud"
>>> r.findall(word)
['Azure', 'Azure']
我期待这样的输出:['Azure'、'Azure'、'Azure cloud']
任何 guide/help 将不胜感激!
您可以 运行 多重搜索。
import itertools
import re
keywords = ["Azure", "Azure cloud"]
patterns = [re.compile(re.escape(w), flags=re.I) for w in keywords]
word = "Azure and Azure cloud"
results = list(itertools.chain.from_iterable(
r.findall(word) for r in patterns
))
print(results)
输出:
['Azure', 'Azure', 'Azure cloud']
附加
if I'd have word = "Azure and azure cloud" - I will have the output as ['Azure', 'azure', 'azure cloud'] - so the 2nd keyword "azure" which is in small, if i would have to get the exact word matching with the "keywords" list which is "Azure", what modification has to be made in the code?
标志re.I
表示忽略大小写。所以只需删除它。
patterns = [re.compile(re.escape(w)) for w in keywords]
追加 2
sorry my last comment was vague, so I want the pattern matching to ignore the case, but while fetching the output I would want the keywords to have exact case what we have in the "keyword" list and not in the "word"
抱歉造成误会。试试这个:
import re
keywords = ["Azure", "azure cloud"]
patterns = [re.compile(w, flags=re.I) for w in keywords]
word = "Azure and azure cloud"
results = [
match_obj.re.pattern
for r in patterns
for match_obj in r.finditer(word)
]
print(results)
输出:
['Azure', 'Azure', 'azure cloud']
我不确定它是否有效,但它确实有效。
请注意,我删除 re.escape 因为它会导致 space 转义,所以结果是:
['Azure', 'Azure', 'azure\ cloud']
findall
查找所有 非重叠匹配项 。在交替的情况下,它会尝试各种情况从左到右。
所以这里发生的是正则表达式引擎达到 Azure cloud
,设法匹配 Azure
并且...开始在 cloud
中再次寻找它,因为它成功匹配Azure
某事。
如果您希望 "Azure and Azure cloud" 产生 "Azure"、"Azure" 和 "Azure Cloud",您需要 运行每个模式单独,而不是一个单一的交替模式。