提取嵌套括号内的字符串及其 pre/postfix Python
Extract sting inside nested brackets and its pre/postfix Python
我正在尝试提取嵌套括号内的字符串并生成它们。
假设我有以下字符串
string = "(A((B|C)D|E|F))"
根据
中的回答
我可以提取嵌套括号内的字符串,但对于我的情况来说它是不同的,因为我在括号的末尾有 "D"
所以这是代码的结果。它看起来与我想要的输出相去甚远
['B|C', 'D|E|F', 'A']
这是我想要的输出
[[['A'],['B|C'],['D']], [['A'],['E|F']']] # '|' means OR
你有什么建议吗,我应该使用正则表达式来实现还是只 运行 遍历所有给定的字符串?
所以可以引出我最后的结果,就是
"ABD"
"ACD"
"AE"
"AF"
在这一点上,我将使用itertools.product
您没有准确指定语言,但似乎允许任意嵌套括号。这不是一种常规语言。我不建议用正则表达式解析它(这可能是因为 python 中的正则表达式不是真正的正则,但即使可能,它也可能是一团糟)。
我建议为您的语言定义一个上下文无关语法并解析它。方法如下:
EXPR -> A EXPR (an expression is an expression preceded by an alphabetic character)
EXPR -> (LIST) EXPR (an expression is a list followed by an expression)
EXPR -> "" (an expression can be an empty string)
LIST -> EXPR | LIST (a list is an expression followed by "|" followed by a list)
LIST -> EXPR (or just one expression)
此文法可以由线性时间内工作的简单自上而下递归解析器解析。这是一个示例实现:
class Parser:
def __init__(self, data):
self.data = data
self.pos = 0
def get_cur_char(self):
"""
Returns the current character or None if the input is over
"""
return None if self.pos == len(self.data) else self.data[self.pos]
def advance(self):
"""
Moves to the next character of the input if the input is not over.
"""
if self.pos < len(self.data):
self.pos += 1
def get_and_advance(self):
"""
Returns the current character and moves to the next one.
"""
res = self.get_cur_char()
self.advance()
return res
def parse_expr(self):
"""
Parse the EXPR according to the speficied grammar.
"""
cur_char = self.get_cur_char()
if cur_char == '(':
# EXPR -> (LIST) EXPR rule
self.advance()
# Parser the list and the rest of the expression and combines
# the result.
prefixes = self.parse_list()
suffices = self.parse_expr()
return [p + s for p in prefixes for s in suffices]
elif not cur_char or cur_char == ')' or cur_char == '|':
# EXPR -> Empty rule. Returns a list with an empty string without
# consuming the input.
return ['']
else:
# EXPR -> A EXPR rule.
# Parses the rest of the expression and prepends the current
# character.
self.advance()
return [cur_char + s for s in self.parse_expr()]
def parse_list(self):
"""
Parser the LIST according to the speficied grammar.
"""
first_expr = self.parse_expr()
# Uses the LIST -> EXPR | LIST rule if the next character is | and
# LIST -> EXPR otherwise
return first_expr + (self.parse_list() if self.get_and_advance() == '|' else [])
if __name__ == '__main__':
string = "(A((B|C)D|E|F))"
parser = Parser(string)
print('\n'.join(parser.parse_expr()))
如果您不熟悉此技术,可以阅读更多相关内容 here。
此实现不是最有效的实现(例如,它显式使用列表而不是迭代器),但它是一个很好的起点。
我建议寻求一种立即针对最终结果的解决方案。所以一个可以进行这种转换的函数:
input: "(A((B|C)D|E|F))"
output: ['ABD', 'ACD', 'AE', 'AF']
这是我建议的代码:
import re
def tokenize(text):
return re.findall(r'[()|]|\w+', text)
def product(a, b):
return [x+y for x in a for y in b] if a and b else a or b
def parse(text):
tokens = tokenize(text)
def recurse(tokens, i):
factor = []
term = []
while i < len(tokens) and tokens[i] != ')':
token = tokens[i]
i += 1
if token == '|':
term.extend(factor)
factor = []
else:
if token == '(':
expr, i = recurse(tokens, i)
else:
expr = [token]
factor = product(factor, expr)
return term+factor, i+1
return recurse(tokens, 0)[0]
string = "(A((B|C)D|E|F))"
print(parse(string))
在 repl.it
上查看 运行
我正在尝试提取嵌套括号内的字符串并生成它们。
假设我有以下字符串
string = "(A((B|C)D|E|F))"
根据
我可以提取嵌套括号内的字符串,但对于我的情况来说它是不同的,因为我在括号的末尾有 "D"
所以这是代码的结果。它看起来与我想要的输出相去甚远
['B|C', 'D|E|F', 'A']
这是我想要的输出
[[['A'],['B|C'],['D']], [['A'],['E|F']']] # '|' means OR
你有什么建议吗,我应该使用正则表达式来实现还是只 运行 遍历所有给定的字符串?
所以可以引出我最后的结果,就是
"ABD"
"ACD"
"AE"
"AF"
在这一点上,我将使用itertools.product
您没有准确指定语言,但似乎允许任意嵌套括号。这不是一种常规语言。我不建议用正则表达式解析它(这可能是因为 python 中的正则表达式不是真正的正则,但即使可能,它也可能是一团糟)。
我建议为您的语言定义一个上下文无关语法并解析它。方法如下:
EXPR -> A EXPR (an expression is an expression preceded by an alphabetic character)
EXPR -> (LIST) EXPR (an expression is a list followed by an expression)
EXPR -> "" (an expression can be an empty string)
LIST -> EXPR | LIST (a list is an expression followed by "|" followed by a list)
LIST -> EXPR (or just one expression)
此文法可以由线性时间内工作的简单自上而下递归解析器解析。这是一个示例实现:
class Parser:
def __init__(self, data):
self.data = data
self.pos = 0
def get_cur_char(self):
"""
Returns the current character or None if the input is over
"""
return None if self.pos == len(self.data) else self.data[self.pos]
def advance(self):
"""
Moves to the next character of the input if the input is not over.
"""
if self.pos < len(self.data):
self.pos += 1
def get_and_advance(self):
"""
Returns the current character and moves to the next one.
"""
res = self.get_cur_char()
self.advance()
return res
def parse_expr(self):
"""
Parse the EXPR according to the speficied grammar.
"""
cur_char = self.get_cur_char()
if cur_char == '(':
# EXPR -> (LIST) EXPR rule
self.advance()
# Parser the list and the rest of the expression and combines
# the result.
prefixes = self.parse_list()
suffices = self.parse_expr()
return [p + s for p in prefixes for s in suffices]
elif not cur_char or cur_char == ')' or cur_char == '|':
# EXPR -> Empty rule. Returns a list with an empty string without
# consuming the input.
return ['']
else:
# EXPR -> A EXPR rule.
# Parses the rest of the expression and prepends the current
# character.
self.advance()
return [cur_char + s for s in self.parse_expr()]
def parse_list(self):
"""
Parser the LIST according to the speficied grammar.
"""
first_expr = self.parse_expr()
# Uses the LIST -> EXPR | LIST rule if the next character is | and
# LIST -> EXPR otherwise
return first_expr + (self.parse_list() if self.get_and_advance() == '|' else [])
if __name__ == '__main__':
string = "(A((B|C)D|E|F))"
parser = Parser(string)
print('\n'.join(parser.parse_expr()))
如果您不熟悉此技术,可以阅读更多相关内容 here。
此实现不是最有效的实现(例如,它显式使用列表而不是迭代器),但它是一个很好的起点。
我建议寻求一种立即针对最终结果的解决方案。所以一个可以进行这种转换的函数:
input: "(A((B|C)D|E|F))"
output: ['ABD', 'ACD', 'AE', 'AF']
这是我建议的代码:
import re
def tokenize(text):
return re.findall(r'[()|]|\w+', text)
def product(a, b):
return [x+y for x in a for y in b] if a and b else a or b
def parse(text):
tokens = tokenize(text)
def recurse(tokens, i):
factor = []
term = []
while i < len(tokens) and tokens[i] != ')':
token = tokens[i]
i += 1
if token == '|':
term.extend(factor)
factor = []
else:
if token == '(':
expr, i = recurse(tokens, i)
else:
expr = [token]
factor = product(factor, expr)
return term+factor, i+1
return recurse(tokens, 0)[0]
string = "(A((B|C)D|E|F))"
print(parse(string))
在 repl.it
上查看 运行