Python 正则表达式多模式,提取正确的组
Python regular expression multi patterns, extracting right groups
我在 Python 中编写了一个正则表达式,它应该搜索 season/s 和 episode/e 后跟一个数字。正如您在我的代码中看到的那样,我支持各种寻找我想要的模式。
import re
episode = re.compile(r"""(?:s|season)(?:\s)(\d+)(?:e|x|episode|\n)(?:\s)(\d+)| # s 01e 02
(?:s|season)(\d+)(?:e|x|episode|\n)(?:\s)(\d+)| # s01e 02
(?:s|season)(?:\s)(\d+)(?:e|x|episode|\n)(\d+)| # s 01e02
(?:s|season)(\d+)(?:e|x|episode|\n)(\d+)| # s01e02
(?:s|season)(\d+)(?:.*)(?:e|x|episode|\n)(\d+)| # s01 random123 e02
(?:s|season)(?:\s)(\d+)(?:.*)(?:e|x|episode|\n)(?:\s)(\d+)| # s 01 random123 e 02
(?:s|season)(?:\s)(\d+)(?:.*)(?:e|x|episode|\n)(\d+)| # s 01 random123 e02
(?:s|season)(\d+)(?:.*)(?:e|x|episode|\n)(?:\s)(\d+) # s01 random123 e 02
""", re.VERBOSE)
test="Hello seinfeld season 01episode 22 foo bar"
match = re.search(episode, test)
print match.group(1), match.group(2)
以下代码将按预期输出 01 22
。
但是如果 test
字符串是这样的:
test="Hello seinfeld season 01 episode 22 foo bar"
我怎么知道要使用哪个组?这意味着我不知道 test
有什么价值。
编辑:也许我可以检查所有组的价值,如果它是真的使用那个特定的组。但这似乎是一种错误的做法。
如何将每个正则表达式模式分解成一个列表,其中每个元素包含一个正则表达式模式?如果您需要 add/remove 更多模式,同时划分每个变体,这将帮助您组织正则表达式模式。您可能还想使用正则表达式命名组。
我对原始示例进行了另外两项更改:1) 单个模式,以及 2) 命名组,例如:
import re
pattern1 = re.compile(r"""(?:s|season)(?:\s)(?P<s>\d+)(?:e|x|episode|\n)(?:\s)(?P<ep>\d+) # s 01e 02""", re.VERBOSE)
pattern2 = re.compile(r"""(?:s|season)(?P<s>\d+)(?:e|x|episode|\n)(?:\s)(?P<ep>\d+) # s01e 02""", re.VERBOSE)
pattern3 = re.compile(r"""(?:s|season)(?:\s)(?P<s>\d+)(?:e|x|episode|\n)(?P<ep>\d+) # s 01e02""", re.VERBOSE)
pattern4 = re.compile(r"""(?:s|season)(?P<s>\d+)(?:e|x|episode|\n)(?P<ep>\d+) # s01e02""", re.VERBOSE)
pattern5 = re.compile(r"""(?:s|season)(?P<s>\d+)(?:.*)(?:e|x|episode|\n)(?P<ep>\d+) # s01 random123 e02""", re.VERBOSE)
pattern6 = re.compile(r"""(?:s|season)(?:\s)(?P<s>\d+)(?:.*)(?:e|x|episode|\n)(?:\s)(?P<ep>\d+) # s 01 random123 e 02""", re.VERBOSE)
pattern7 = re.compile(r"""(?:s|season)(?:\s)(?P<s>\d+)(?:.*)(?:e|x|episode|\n)(?P<ep>\d+) # s 01 random123 e02""", re.VERBOSE)
pattern8 = re.compile(r"""(?:s|season)(?P<s>\d+)(?:.*)(?:e|x|episode|\n)(?:\s)(?P<ep>\d+) # s01 random123 e 02""", re.VERBOSE)
patterns = [pattern1, pattern2, pattern3, pattern4, pattern5, pattern6, pattern7, pattern8 ]
test="Hello seinfeld season 01episode 22 foo bar"
for idx, p in enumerate(patterns):
m = re.search(p, test)
if m:
print('MATCHED PATTERN: {}'.format( patterns[idx].pattern ) )
print(' SEASON: {}'.format( m.group('s')) )
print(' EPISODE: {}'.format( m.group('ep')) )
输出:
MATCHED PATTERN: (?:s|season)(?:\s)(?P<s>\d+)(?:e|x|episode|\n)(?:\s)(?P<ep>\d+) # s 01e 02
SEASON: 01
EPISODE: 22
MATCHED PATTERN: (?:s|season)(?:\s)(?P<s>\d+)(?:.*)(?:e|x|episode|\n)(?:\s)(?P<ep>\d+) # s 01 random123 e 02
SEASON: 01
EPISODE: 22
当然,您需要添加一些额外的逻辑来选择要进行的匹配(例如,您可以轻松地选择进行第一个完整的匹配),但这至少让您更清楚地了解哪个正则表达式模式命中。
试试这个: \s*(season|s)\s*(\d+)(episode|e|x)\s*(\d+)
你在第 2 组和第 4 组中的匹配词
import re
p = re.compile(ur'\s*(season|s)\s*(\d+)(episode|e|x)\s*(\d+)', re.MULTILINE)
test_str = u"Hello seinfeld season 01episode 22 foo bar\ns 01e 02\n"
re.findall(p, test_str)
我在 Python 中编写了一个正则表达式,它应该搜索 season/s 和 episode/e 后跟一个数字。正如您在我的代码中看到的那样,我支持各种寻找我想要的模式。
import re
episode = re.compile(r"""(?:s|season)(?:\s)(\d+)(?:e|x|episode|\n)(?:\s)(\d+)| # s 01e 02
(?:s|season)(\d+)(?:e|x|episode|\n)(?:\s)(\d+)| # s01e 02
(?:s|season)(?:\s)(\d+)(?:e|x|episode|\n)(\d+)| # s 01e02
(?:s|season)(\d+)(?:e|x|episode|\n)(\d+)| # s01e02
(?:s|season)(\d+)(?:.*)(?:e|x|episode|\n)(\d+)| # s01 random123 e02
(?:s|season)(?:\s)(\d+)(?:.*)(?:e|x|episode|\n)(?:\s)(\d+)| # s 01 random123 e 02
(?:s|season)(?:\s)(\d+)(?:.*)(?:e|x|episode|\n)(\d+)| # s 01 random123 e02
(?:s|season)(\d+)(?:.*)(?:e|x|episode|\n)(?:\s)(\d+) # s01 random123 e 02
""", re.VERBOSE)
test="Hello seinfeld season 01episode 22 foo bar"
match = re.search(episode, test)
print match.group(1), match.group(2)
以下代码将按预期输出 01 22
。
但是如果 test
字符串是这样的:
test="Hello seinfeld season 01 episode 22 foo bar"
我怎么知道要使用哪个组?这意味着我不知道 test
有什么价值。
编辑:也许我可以检查所有组的价值,如果它是真的使用那个特定的组。但这似乎是一种错误的做法。
如何将每个正则表达式模式分解成一个列表,其中每个元素包含一个正则表达式模式?如果您需要 add/remove 更多模式,同时划分每个变体,这将帮助您组织正则表达式模式。您可能还想使用正则表达式命名组。
我对原始示例进行了另外两项更改:1) 单个模式,以及 2) 命名组,例如:
import re
pattern1 = re.compile(r"""(?:s|season)(?:\s)(?P<s>\d+)(?:e|x|episode|\n)(?:\s)(?P<ep>\d+) # s 01e 02""", re.VERBOSE)
pattern2 = re.compile(r"""(?:s|season)(?P<s>\d+)(?:e|x|episode|\n)(?:\s)(?P<ep>\d+) # s01e 02""", re.VERBOSE)
pattern3 = re.compile(r"""(?:s|season)(?:\s)(?P<s>\d+)(?:e|x|episode|\n)(?P<ep>\d+) # s 01e02""", re.VERBOSE)
pattern4 = re.compile(r"""(?:s|season)(?P<s>\d+)(?:e|x|episode|\n)(?P<ep>\d+) # s01e02""", re.VERBOSE)
pattern5 = re.compile(r"""(?:s|season)(?P<s>\d+)(?:.*)(?:e|x|episode|\n)(?P<ep>\d+) # s01 random123 e02""", re.VERBOSE)
pattern6 = re.compile(r"""(?:s|season)(?:\s)(?P<s>\d+)(?:.*)(?:e|x|episode|\n)(?:\s)(?P<ep>\d+) # s 01 random123 e 02""", re.VERBOSE)
pattern7 = re.compile(r"""(?:s|season)(?:\s)(?P<s>\d+)(?:.*)(?:e|x|episode|\n)(?P<ep>\d+) # s 01 random123 e02""", re.VERBOSE)
pattern8 = re.compile(r"""(?:s|season)(?P<s>\d+)(?:.*)(?:e|x|episode|\n)(?:\s)(?P<ep>\d+) # s01 random123 e 02""", re.VERBOSE)
patterns = [pattern1, pattern2, pattern3, pattern4, pattern5, pattern6, pattern7, pattern8 ]
test="Hello seinfeld season 01episode 22 foo bar"
for idx, p in enumerate(patterns):
m = re.search(p, test)
if m:
print('MATCHED PATTERN: {}'.format( patterns[idx].pattern ) )
print(' SEASON: {}'.format( m.group('s')) )
print(' EPISODE: {}'.format( m.group('ep')) )
输出:
MATCHED PATTERN: (?:s|season)(?:\s)(?P<s>\d+)(?:e|x|episode|\n)(?:\s)(?P<ep>\d+) # s 01e 02
SEASON: 01
EPISODE: 22
MATCHED PATTERN: (?:s|season)(?:\s)(?P<s>\d+)(?:.*)(?:e|x|episode|\n)(?:\s)(?P<ep>\d+) # s 01 random123 e 02
SEASON: 01
EPISODE: 22
当然,您需要添加一些额外的逻辑来选择要进行的匹配(例如,您可以轻松地选择进行第一个完整的匹配),但这至少让您更清楚地了解哪个正则表达式模式命中。
试试这个: \s*(season|s)\s*(\d+)(episode|e|x)\s*(\d+)
你在第 2 组和第 4 组中的匹配词
import re
p = re.compile(ur'\s*(season|s)\s*(\d+)(episode|e|x)\s*(\d+)', re.MULTILINE)
test_str = u"Hello seinfeld season 01episode 22 foo bar\ns 01e 02\n"
re.findall(p, test_str)