正则表达式解析字符串
Regular expression parse string
我正在努力正确解析文本。文本中有很多变化。理想情况下,我想在 Python 中执行此操作,但任何语言都可以。
示例字符串:
"if magic code is 543, and type is 5642, 912342, or 7425, type has to have a period. EX: 02-15-99"
"If Magic Code is 722, 43, or 643256 and types is 43234, 5356, and 2112, type has to start with period."
"if magic code is 4542 it is not valid in type."
"if magic code is 532 and date is within 10 years from current data, and the type is 43, the type must begin with law number."
我想要的结果:
[543] [5642, 912342, 7425][type has to have a period.]
[722, 43, 643256][3234, 5356, and 2112][type has to start with period.]
[4542][it is not valid in type.]
[532][43][the type must begin with law number.]
还有其他变体,但你看到了这个概念。对不起,我不太擅长正则表达式。
好吧......这就是你所要求的。但它非常丑陋,而且非常具体地针对您提供的示例。我怀疑它会针对真实数据文件失败。
面对这种解析工作,解决问题的一种方法是 运行 通过一些初步清理输入数据,尽可能简化和合理化文本。例如,处理不同风格的整数列表很烦人,并且会使正则表达式更加复杂。如果您可以删除不必要的整数之间的逗号并删除终端 "or-and" ,则正则表达式会简单得多。一旦完成这种清理,有时您可以应用一个或多个正则表达式来提取所需的位。在某些情况下,无法满足主要正则表达式的异常值数量可以通过特定查找或硬编码特殊情况规则来处理。
import re
lines = [
"if magic code is 543, and type is 5642, 912342, or 7425, type has to have a period. EX: 02-15-99",
"If Magic Code is 722, 43, or 643256 and types is 43234, 5356, and 2112, type has to start with period.",
"if magic code is 4542 it is not valid in type.",
"if magic code is 532 and date is within 10 years from current data, and the type is 43, the type must begin with law number.",
]
mcs_rgx = re.compile(r'magic code is (\d+ (or|and) \d+|\d+(, \d+)*,? (or|and) \d+|\d+)', re.IGNORECASE)
types_rgx = re.compile(r'types? is (\d+ (or|and) \d+|\d+(, \d+)*,? (or|and) \d+|\d+)', re.IGNORECASE)
rest_rgx1 = re.compile(r'(type (has|must).+)')
rest_rgx2 = re.compile(r'.+\d(.+)')
nums_rgx = re.compile(r'\d+')
for line in lines:
m = mcs_rgx.search(line)
if m:
mcs_text = m.group(1)
mcs = map(int, nums_rgx.findall(mcs_text))
else:
mcs = []
m = types_rgx.search(line)
if m:
types_text = m.group(1)
types = map(int, nums_rgx.findall(types_text))
else:
types = []
m = rest_rgx1.search(line)
if m:
rest = [m.group(1)]
else:
m = rest_rgx2.search(line)
if m:
rest = [m.group(1)]
else:
rest = ['']
print mcs, types, rest
输出:
[543] [5642, 912342, 7425] ['type has to have a period. EX: 02-15-99']
[722, 43, 643256] [43234, 5356, 2112] ['type has to start with period.']
[4542] [] [' it is not valid in type.']
[532] [43] ['type must begin with law number.']
这是一个包含单个正则表达式以及一些事后清理的解决方案。这适用于您的所有示例,但如评论中所述,如果您的句子变化比这大得多,您应该探索正则表达式以外的选项。
import re
sentences = ["if magic code is 543, and type is 5642, 912342, or 7425, type has to have a period. EX: 02-15-99",
"If Magic Code is 722, 43, or 643256 and types is 43234, 5356, and 2112, type has to start with period.",
"if magic code is 4542 it is not valid in type.",
"if magic code is 532 and date is within 10 years from current data, and the type is 43, the type must begin with law number."]
pat = '(?i)^if\smagic\scode\sis\s(\d+(?:,?\s(?:\d+|or))*)(?:.*types?\sis\s(\d+(?:,?\s(?:\d+|or|and))*,)(.*\.)|(.*\.))'
find_ints = lambda s: [int(d) for d in re.findall('\d+', s)]
matches = [[g for g in re.match(pat,s).groups() if g] for s in sentences]
results = [[find_ints(m) for m in match[:-1]]+[[match[-1].strip()]] for match in matches]
如果您需要像您的示例中那样精美打印的内容:
for r in results:
print(*r, sep='')
我正在努力正确解析文本。文本中有很多变化。理想情况下,我想在 Python 中执行此操作,但任何语言都可以。
示例字符串:
"if magic code is 543, and type is 5642, 912342, or 7425, type has to have a period. EX: 02-15-99"
"If Magic Code is 722, 43, or 643256 and types is 43234, 5356, and 2112, type has to start with period."
"if magic code is 4542 it is not valid in type."
"if magic code is 532 and date is within 10 years from current data, and the type is 43, the type must begin with law number."
我想要的结果:
[543] [5642, 912342, 7425][type has to have a period.]
[722, 43, 643256][3234, 5356, and 2112][type has to start with period.]
[4542][it is not valid in type.]
[532][43][the type must begin with law number.]
还有其他变体,但你看到了这个概念。对不起,我不太擅长正则表达式。
好吧......这就是你所要求的。但它非常丑陋,而且非常具体地针对您提供的示例。我怀疑它会针对真实数据文件失败。
面对这种解析工作,解决问题的一种方法是 运行 通过一些初步清理输入数据,尽可能简化和合理化文本。例如,处理不同风格的整数列表很烦人,并且会使正则表达式更加复杂。如果您可以删除不必要的整数之间的逗号并删除终端 "or-and" ,则正则表达式会简单得多。一旦完成这种清理,有时您可以应用一个或多个正则表达式来提取所需的位。在某些情况下,无法满足主要正则表达式的异常值数量可以通过特定查找或硬编码特殊情况规则来处理。
import re
lines = [
"if magic code is 543, and type is 5642, 912342, or 7425, type has to have a period. EX: 02-15-99",
"If Magic Code is 722, 43, or 643256 and types is 43234, 5356, and 2112, type has to start with period.",
"if magic code is 4542 it is not valid in type.",
"if magic code is 532 and date is within 10 years from current data, and the type is 43, the type must begin with law number.",
]
mcs_rgx = re.compile(r'magic code is (\d+ (or|and) \d+|\d+(, \d+)*,? (or|and) \d+|\d+)', re.IGNORECASE)
types_rgx = re.compile(r'types? is (\d+ (or|and) \d+|\d+(, \d+)*,? (or|and) \d+|\d+)', re.IGNORECASE)
rest_rgx1 = re.compile(r'(type (has|must).+)')
rest_rgx2 = re.compile(r'.+\d(.+)')
nums_rgx = re.compile(r'\d+')
for line in lines:
m = mcs_rgx.search(line)
if m:
mcs_text = m.group(1)
mcs = map(int, nums_rgx.findall(mcs_text))
else:
mcs = []
m = types_rgx.search(line)
if m:
types_text = m.group(1)
types = map(int, nums_rgx.findall(types_text))
else:
types = []
m = rest_rgx1.search(line)
if m:
rest = [m.group(1)]
else:
m = rest_rgx2.search(line)
if m:
rest = [m.group(1)]
else:
rest = ['']
print mcs, types, rest
输出:
[543] [5642, 912342, 7425] ['type has to have a period. EX: 02-15-99']
[722, 43, 643256] [43234, 5356, 2112] ['type has to start with period.']
[4542] [] [' it is not valid in type.']
[532] [43] ['type must begin with law number.']
这是一个包含单个正则表达式以及一些事后清理的解决方案。这适用于您的所有示例,但如评论中所述,如果您的句子变化比这大得多,您应该探索正则表达式以外的选项。
import re
sentences = ["if magic code is 543, and type is 5642, 912342, or 7425, type has to have a period. EX: 02-15-99",
"If Magic Code is 722, 43, or 643256 and types is 43234, 5356, and 2112, type has to start with period.",
"if magic code is 4542 it is not valid in type.",
"if magic code is 532 and date is within 10 years from current data, and the type is 43, the type must begin with law number."]
pat = '(?i)^if\smagic\scode\sis\s(\d+(?:,?\s(?:\d+|or))*)(?:.*types?\sis\s(\d+(?:,?\s(?:\d+|or|and))*,)(.*\.)|(.*\.))'
find_ints = lambda s: [int(d) for d in re.findall('\d+', s)]
matches = [[g for g in re.match(pat,s).groups() if g] for s in sentences]
results = [[find_ints(m) for m in match[:-1]]+[[match[-1].strip()]] for match in matches]
如果您需要像您的示例中那样精美打印的内容:
for r in results:
print(*r, sep='')