正则表达式匹配字符串除以 'and'
Regex match strings divided by 'and'
我需要解析一个字符串以从字符串中获取所需的数字和位置,例如:
2 Better Developers and 3 Testers
5 Mechanics and chef
medic and 3 nurses
目前我正在使用这样的代码,其中 returns 元组列表,例如 [('2', 'Better Developers'), ('3', 'Testers')]
:
def parse_workers_list_from_str(string_value: str) -> [(str, str)]:
result: [(str, str)] = []
if string_value:
for part in string_value.split('and'):
result.append(re.findall(r'(?: *)(\d+|)(?: |)([\w ]+)', part.strip())[0])
return result
我可以不用 .split()
只使用正则表达式吗?
与 re.MULTILINE
一起,您可以在一个正则表达式中完成所有操作,这也将正确拆分所有内容:
>>> s = """2 Better Developers and 3 Testers
5 Mechanics and chef
medic and 3 nurses"""
>>> re.findall(r"\s*(\d*)\s*(.+?)(?:\s+and\s+|$)", s, re.MULTILINE)
[('2', 'Better Developers'), ('3', 'Testers'), ('5', 'Mechanics'), ('', 'chef'), ('', 'medic'), ('3', 'nurses')]
有空 ''
到 1
的解释和转换:
import re
s = """2 Better Developers and 3 Testers
5 Mechanics and chef
medic and 3 nurses"""
results = re.findall(r"""
# Capture the number if one exists
(\d*)
# Remove spacing between number and text
\s*
# Caputre the text
(.+?)
# Attempt to match the word 'and' or the end of the line
(?:\s+and\s+|$\n?)
""", s, re.MULTILINE|re.VERBOSE)
results = [(int(n or 1), t.title()) for n, t in results]
results == [(2, 'Better Developers'), (3, 'Testers'), (5, 'Mechanics'), (1, 'Chef'), (1, 'Medic'), (3, 'Nurses')]
如果你想处理多个 and
分离器,那么你应该考虑使用 PyPi regex
模块,它允许我们使用 分支重置组 即 (?!...)
提供 Subpatterns 在此构造的每个备选方案中声明将从相同的索引重新开始。
(?|(\d*) *(\b[a-z]+(?: [a-z]+)*?)(?= and )|(?<= and )(\d*) *(\b[a-z]+(?: [a-z]+)*))
import regex
rx = regex.compile(r'(?|(\d*) *(\b[a-z]+(?: [a-z]+)*?)(?= and )|(?<= and )(\d*) *(\b[a-z]+(?: [a-z]+)*))', regex.I)
arr = ['2 Better Developers and 3 Testers', '5 Mechanics and chef', 'medic and 3 nurses', '5 foo', '5 Mechanics and 2 chefs and tester']
for s in arr: print (rx.findall(s), ':', s)
输出:
[('2', 'Better Developers'), ('3', 'Testers')] : 2 Better Developers and 3 Testers
[('5', 'Mechanics'), ('', 'chef')] : 5 Mechanics and chef
[('', 'medic'), ('3', 'nurses')] : medic and 3 nurses
[] : 5 foo
[('5', 'Mechanics'), ('2', 'chefs'), ('', 'tester')] : 5 Mechanics and 2 chefs and tester
根据原始问题发布的早期答案,存在单个 and
.
您可以使用这个正则表达式:
(\d*) *(\S+(?: \S+)*?) and (\d*) *(\S+(?: \S+)*)
这里我们匹配 and
并在两侧用单个 space 包围。在 and
之前和之后,我们使用这个子模式进行匹配:
(\d*) *(\S+(?: \S+)*?)
匹配可选的 0+ 数字开头,后跟 0 个或多个 space 后跟 1 个或多个由 space 分隔的非白色 space 字符串。
代码:
import re
arr = ['2 Better Developers and 3 Testers', '5 Mechanics and chef', 'medic and 3 nurses', '5 foo']
rx = re.compile(r'(\d*) *(\S+(?: \S+)*?) and (\d*) *(\S+(?: \S+)*)')
for s in arr: print (rx.findall(s))
输出:
[('2', 'Better Developers', '3', 'Testers')]
[('5', 'Mechanics', '', 'chef')]
[('', 'medic', '3', 'nurses')]
[]
我需要解析一个字符串以从字符串中获取所需的数字和位置,例如:
2 Better Developers and 3 Testers
5 Mechanics and chef
medic and 3 nurses
目前我正在使用这样的代码,其中 returns 元组列表,例如 [('2', 'Better Developers'), ('3', 'Testers')]
:
def parse_workers_list_from_str(string_value: str) -> [(str, str)]:
result: [(str, str)] = []
if string_value:
for part in string_value.split('and'):
result.append(re.findall(r'(?: *)(\d+|)(?: |)([\w ]+)', part.strip())[0])
return result
我可以不用 .split()
只使用正则表达式吗?
与 re.MULTILINE
一起,您可以在一个正则表达式中完成所有操作,这也将正确拆分所有内容:
>>> s = """2 Better Developers and 3 Testers
5 Mechanics and chef
medic and 3 nurses"""
>>> re.findall(r"\s*(\d*)\s*(.+?)(?:\s+and\s+|$)", s, re.MULTILINE)
[('2', 'Better Developers'), ('3', 'Testers'), ('5', 'Mechanics'), ('', 'chef'), ('', 'medic'), ('3', 'nurses')]
有空 ''
到 1
的解释和转换:
import re
s = """2 Better Developers and 3 Testers
5 Mechanics and chef
medic and 3 nurses"""
results = re.findall(r"""
# Capture the number if one exists
(\d*)
# Remove spacing between number and text
\s*
# Caputre the text
(.+?)
# Attempt to match the word 'and' or the end of the line
(?:\s+and\s+|$\n?)
""", s, re.MULTILINE|re.VERBOSE)
results = [(int(n or 1), t.title()) for n, t in results]
results == [(2, 'Better Developers'), (3, 'Testers'), (5, 'Mechanics'), (1, 'Chef'), (1, 'Medic'), (3, 'Nurses')]
如果你想处理多个 and
分离器,那么你应该考虑使用 PyPi regex
模块,它允许我们使用 分支重置组 即 (?!...)
提供 Subpatterns 在此构造的每个备选方案中声明将从相同的索引重新开始。
(?|(\d*) *(\b[a-z]+(?: [a-z]+)*?)(?= and )|(?<= and )(\d*) *(\b[a-z]+(?: [a-z]+)*))
import regex
rx = regex.compile(r'(?|(\d*) *(\b[a-z]+(?: [a-z]+)*?)(?= and )|(?<= and )(\d*) *(\b[a-z]+(?: [a-z]+)*))', regex.I)
arr = ['2 Better Developers and 3 Testers', '5 Mechanics and chef', 'medic and 3 nurses', '5 foo', '5 Mechanics and 2 chefs and tester']
for s in arr: print (rx.findall(s), ':', s)
输出:
[('2', 'Better Developers'), ('3', 'Testers')] : 2 Better Developers and 3 Testers
[('5', 'Mechanics'), ('', 'chef')] : 5 Mechanics and chef
[('', 'medic'), ('3', 'nurses')] : medic and 3 nurses
[] : 5 foo
[('5', 'Mechanics'), ('2', 'chefs'), ('', 'tester')] : 5 Mechanics and 2 chefs and tester
根据原始问题发布的早期答案,存在单个 and
.
您可以使用这个正则表达式:
(\d*) *(\S+(?: \S+)*?) and (\d*) *(\S+(?: \S+)*)
这里我们匹配 and
并在两侧用单个 space 包围。在 and
之前和之后,我们使用这个子模式进行匹配:
(\d*) *(\S+(?: \S+)*?)
匹配可选的 0+ 数字开头,后跟 0 个或多个 space 后跟 1 个或多个由 space 分隔的非白色 space 字符串。
代码:
import re
arr = ['2 Better Developers and 3 Testers', '5 Mechanics and chef', 'medic and 3 nurses', '5 foo']
rx = re.compile(r'(\d*) *(\S+(?: \S+)*?) and (\d*) *(\S+(?: \S+)*)')
for s in arr: print (rx.findall(s))
输出:
[('2', 'Better Developers', '3', 'Testers')]
[('5', 'Mechanics', '', 'chef')]
[('', 'medic', '3', 'nurses')]
[]