提取嵌套括号内的字符串及其 pre/postfix Python

Question

我正在尝试提取嵌套括号内的字符串并生成它们。

假设我有以下字符串

string = "(A((B|C)D|E|F))"

根据

中的回答

我可以提取嵌套括号内的字符串，但对于我的情况来说它是不同的，因为我在括号的末尾有 "D" 所以这是代码的结果。它看起来与我想要的输出相去甚远

['B|C', 'D|E|F', 'A']

这是我想要的输出

[[['A'],['B|C'],['D']], [['A'],['E|F']']]     # '|' means OR

你有什么建议吗，我应该使用正则表达式来实现还是只运行遍历所有给定的字符串？

所以可以引出我最后的结果，就是

"ABD"
"ACD"
"AE"
"AF"

在这一点上，我将使用itertools.product

Answer 1

您没有准确指定语言，但似乎允许任意嵌套括号。这不是一种常规语言。我不建议用正则表达式解析它（这可能是因为 python 中的正则表达式不是真正的正则，但即使可能，它也可能是一团糟）。

我建议为您的语言定义一个上下文无关语法并解析它。方法如下：

EXPR -> A EXPR (an expression is an expression preceded by an alphabetic character)
EXPR -> (LIST) EXPR (an expression is a list followed by an expression)
EXPR -> "" (an expression can be an empty string)

LIST -> EXPR | LIST (a list is an expression followed by "|" followed by a list)
LIST -> EXPR (or just one expression)

此文法可以由线性时间内工作的简单自上而下递归解析器解析。这是一个示例实现：

class Parser:

    def __init__(self, data):
        self.data = data
        self.pos = 0

    def get_cur_char(self):
        """
        Returns the current character or None if the input is over
        """
        return None if self.pos == len(self.data) else self.data[self.pos]

    def advance(self):
        """
        Moves to the next character of the input if the input is not over.
        """
        if self.pos < len(self.data):
            self.pos += 1

    def get_and_advance(self):
        """
        Returns the current character and moves to the next one.
        """
        res = self.get_cur_char()
        self.advance()
        return res

    def parse_expr(self):
        """
        Parse the EXPR according to the speficied grammar.
        """
        cur_char = self.get_cur_char()
        if cur_char == '(':
            # EXPR -> (LIST) EXPR rule
            self.advance()
            # Parser the list and the rest of the expression and combines
            # the result.
            prefixes = self.parse_list()
            suffices = self.parse_expr()
            return [p + s for p in prefixes for s in suffices]
        elif not cur_char or cur_char == ')' or cur_char == '|':
            # EXPR -> Empty rule. Returns a list with an empty string without
            # consuming the input.
            return ['']
        else:
            # EXPR -> A EXPR rule.
            # Parses the rest of the expression and prepends the current 
            # character.
            self.advance()
            return [cur_char + s for s in self.parse_expr()]

    def parse_list(self):
        """
        Parser the LIST according to the speficied grammar.
        """
        first_expr = self.parse_expr()
        # Uses the LIST -> EXPR | LIST rule if the next character is | and
        # LIST -> EXPR otherwise    
        return first_expr + (self.parse_list() if self.get_and_advance() == '|' else [])


if __name__ == '__main__':
    string = "(A((B|C)D|E|F))"
    parser = Parser(string)
    print('\n'.join(parser.parse_expr()))

如果您不熟悉此技术，可以阅读更多相关内容 here。

此实现不是最有效的实现（例如，它显式使用列表而不是迭代器），但它是一个很好的起点。

Answer 2

我建议寻求一种立即针对最终结果的解决方案。所以一个可以进行这种转换的函数：

input: "(A((B|C)D|E|F))"
output: ['ABD', 'ACD', 'AE', 'AF']

这是我建议的代码：

import re

def tokenize(text):
    return re.findall(r'[()|]|\w+', text)

def product(a, b):
    return [x+y for x in a for y in b] if a and b else a or b

def parse(text):
    tokens = tokenize(text)

    def recurse(tokens, i):
        factor = []
        term = []
        while i < len(tokens) and tokens[i] != ')':
            token = tokens[i]
            i += 1
            if token == '|':
                term.extend(factor)
                factor = []
            else:
                if token == '(':
                    expr, i = recurse(tokens, i)
                else:
                    expr = [token]
                factor = product(factor, expr)
        return term+factor, i+1

    return recurse(tokens, 0)[0]

string = "(A((B|C)D|E|F))"

print(parse(string))

在 repl.it

上查看运行

提取嵌套括号内的字符串及其 pre/postfix Python

Extract sting inside nested brackets and its pre/postfix Python

python

algorithm

brackets