如何从文件加载多个正则表达式模式并匹配给定的字符串?
How to load multiple regex patterns from a file and match a given string?
根据提供的代码(针对此 post 进行了简化),有人可以帮助展示我如何获得要加载的正则表达式模式列表(如果 'list' 是要使用的正确类型)从文本文件中匹配到单个字符串?
有很多从文件中加载 text/text 字符串并匹配正则表达式模式的示例,但反之则不然 - 许多正则表达式模式匹配一个文本字符串。
如果我手动创建一个列表,您可能会在代码中看到 运行 re.compile 我可以使用模式列表来匹配字符串。但是,从文件加载时 re.compile 放在哪里?
import regex as re
fname = 'regex_strings_short.txt'
string_to_match = 'onload=alert'
# Create a manual list of regexes
manual_regexes = [
re.compile(r'(?i)\bHP\b(?:[^.,;]{1,20}?)\bnumber\b'),
re.compile(r'(?i)\bgmail\b(?:[^.,;]{1,20}?)\bnumber\b'),
re.compile(r'(?i)\bearthlink\b(?:[^.,;]{1,20}?)\bnumber\b '),
re.compile(r'(?i)onload=alert')
]
# Create a text file with these five example patterns
'''
(?i)\bHP\b(?:[^.,;]{1,20}?)\bnumber\b
(?i)\bgmail\b(?:[^.,;]{1,20}?)\bnumber\b
(?i)\bearthlink\b(?:[^.,;]{1,20}?)\bnumber\b
(?i)onload=alert
(?i)hello
'''
# Import a list of regex patterns from the created file
with open(fname, 'r') as file:
imported_regexes = file.readlines()
# Notice the difference in the formatting of the manual list with 'regex.Regex' and 'flags=regex.I | regex.V0' wrapping each item
print(manual_regexes)
print('---')
print(imported_regexes)
# A match is found in the manual list, but no match found in the imported list
if re.match(imported_regexes[3], my_string):
print('Match found in imported_regexes.')
else:
print('No match in imported_regexes.')
print('---')
if re.match(manual_regexes[3], my_string):
print('Match found in manual_regexes.')
else:
print('No match in manual_regexes.')
没有 imported_regexes 的匹配项,但有 manual_regexes 的匹配项。
更新:下面的代码是对我有用的最终解决方案。发布它,因为它可能会帮助有人登陆这里并需要解决方案。
# You must use regex as re and not just 'import re' as \p{} is not correctly escaped
import regex as re
# Add the post/string to match below
my_string = '<p>HP Support number</p>'
fname = 'regex_strings.txt'
# Contents of text file similar to the below
# but without the leading # space - that's only because it's an inline comment here
# (?i)\bHP\b(?:[^.,;]{1,20}?)\bnumber\b
# (?i)\bgmail\b(?:[^.,;]{1,20}?)\bnumber\b
# (?i)】\b(?:[^.,;]{1,1000}?)\p{Lo}
# Import a list of regex patterns from a file
with open(fname, 'r', encoding="utf8") as f:
loaded_patterns = f.read().splitlines()
# print(loaded_patterns)
print(len(loaded_patterns))
found = 0
for index, pattern in enumerate (loaded_patterns):
if re.findall(loaded_patterns[index],my_string):
print('Match found. ' + loaded_patterns[index])
found = 1
if found == 0:
print('No matching regex found.')
re.match
接受字符串和编译后的正则表达式作为参数,并在内部将字符串转换为编译后的正则表达式对象。您可以出于优化目的调用 re.compile
(多次调用相同的正则表达式),但这对于程序正确性来说完全是可选的。
如果程序没有打印出导入的正则表达式匹配,那是因为 readlines()
在您的字符串中一直跟在 '\n'
后面。因此 re.match('(?i)onload=alert\n')
returns False
与字符串相匹配。您可以在经过清理的字符串上调用 re.compile,也可以不调用。
with open(fname, 'r') as file:
imported_regexes = file.readlines()
print(re.match(imported_regexes[3].strip('\n'), string_to_match))
输出匹配对象。
根据提供的代码(针对此 post 进行了简化),有人可以帮助展示我如何获得要加载的正则表达式模式列表(如果 'list' 是要使用的正确类型)从文本文件中匹配到单个字符串?
有很多从文件中加载 text/text 字符串并匹配正则表达式模式的示例,但反之则不然 - 许多正则表达式模式匹配一个文本字符串。
如果我手动创建一个列表,您可能会在代码中看到 运行 re.compile 我可以使用模式列表来匹配字符串。但是,从文件加载时 re.compile 放在哪里?
import regex as re
fname = 'regex_strings_short.txt'
string_to_match = 'onload=alert'
# Create a manual list of regexes
manual_regexes = [
re.compile(r'(?i)\bHP\b(?:[^.,;]{1,20}?)\bnumber\b'),
re.compile(r'(?i)\bgmail\b(?:[^.,;]{1,20}?)\bnumber\b'),
re.compile(r'(?i)\bearthlink\b(?:[^.,;]{1,20}?)\bnumber\b '),
re.compile(r'(?i)onload=alert')
]
# Create a text file with these five example patterns
'''
(?i)\bHP\b(?:[^.,;]{1,20}?)\bnumber\b
(?i)\bgmail\b(?:[^.,;]{1,20}?)\bnumber\b
(?i)\bearthlink\b(?:[^.,;]{1,20}?)\bnumber\b
(?i)onload=alert
(?i)hello
'''
# Import a list of regex patterns from the created file
with open(fname, 'r') as file:
imported_regexes = file.readlines()
# Notice the difference in the formatting of the manual list with 'regex.Regex' and 'flags=regex.I | regex.V0' wrapping each item
print(manual_regexes)
print('---')
print(imported_regexes)
# A match is found in the manual list, but no match found in the imported list
if re.match(imported_regexes[3], my_string):
print('Match found in imported_regexes.')
else:
print('No match in imported_regexes.')
print('---')
if re.match(manual_regexes[3], my_string):
print('Match found in manual_regexes.')
else:
print('No match in manual_regexes.')
没有 imported_regexes 的匹配项,但有 manual_regexes 的匹配项。
更新:下面的代码是对我有用的最终解决方案。发布它,因为它可能会帮助有人登陆这里并需要解决方案。
# You must use regex as re and not just 'import re' as \p{} is not correctly escaped
import regex as re
# Add the post/string to match below
my_string = '<p>HP Support number</p>'
fname = 'regex_strings.txt'
# Contents of text file similar to the below
# but without the leading # space - that's only because it's an inline comment here
# (?i)\bHP\b(?:[^.,;]{1,20}?)\bnumber\b
# (?i)\bgmail\b(?:[^.,;]{1,20}?)\bnumber\b
# (?i)】\b(?:[^.,;]{1,1000}?)\p{Lo}
# Import a list of regex patterns from a file
with open(fname, 'r', encoding="utf8") as f:
loaded_patterns = f.read().splitlines()
# print(loaded_patterns)
print(len(loaded_patterns))
found = 0
for index, pattern in enumerate (loaded_patterns):
if re.findall(loaded_patterns[index],my_string):
print('Match found. ' + loaded_patterns[index])
found = 1
if found == 0:
print('No matching regex found.')
re.match
接受字符串和编译后的正则表达式作为参数,并在内部将字符串转换为编译后的正则表达式对象。您可以出于优化目的调用 re.compile
(多次调用相同的正则表达式),但这对于程序正确性来说完全是可选的。
如果程序没有打印出导入的正则表达式匹配,那是因为 readlines()
在您的字符串中一直跟在 '\n'
后面。因此 re.match('(?i)onload=alert\n')
returns False
与字符串相匹配。您可以在经过清理的字符串上调用 re.compile,也可以不调用。
with open(fname, 'r') as file:
imported_regexes = file.readlines()
print(re.match(imported_regexes[3].strip('\n'), string_to_match))
输出匹配对象。