如何将opening/closing标签变成PHP中的关联数组?

How to turn opening/closing tags into an associative array in PHP?

我有一个自然语言处理解析树作为

(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))

我想把它存储在一个关联数组中,但是在PHP中没有函数,因为NLP通常在python中完成。

因此,我应该解析左括号和右括号来构建一个树结构的关联数组。我可以想到两个选择

  1. 用任意 XML 或 HTML 标签替换括号,并将其解析为 XML 或 HTML 文档。
  2. 使用正则表达式。

我认为第一种方法不标准,正则表达式模式在复杂情况下可能会中断。

你能推荐一个可靠的方法吗?

关联数组可以有任何形式,因为它不难操作(我需要它在一个循环中),但它可以像

Array (
[0] = > word => ROOT, tag => S, children => Array (
    [0] word => I, tag = > NP, children => Array()
    [1] word => ROOT, tag => VP, children => Array (
        [0] => word => ROOT, tag => VP, children => Array ( .... )
        [1] => word => ROOT, tag => PP, children => Array ( .... )
)
)
)

也可以是

Array (
[0] = > Array([0] => S, [1] => Array (
    [0] Array([0] => NP, [1] => 'I') // child array is replaced by a string
    [1] Array([0] => VP, [1] => Array (
        [0] => Array([0] => VP, [1] => Array ( .... )
        [1] => Array([0] => PP, [1] => Array ( .... )
    )
)

喜欢使用词法分析器生成器bison or flex or just write your own lexer by hands, This answer 有一些您需要的有用信息。

这是一个用 PHP 编写的快速但不完整的 POC 片段,它将按预期输出关联数组。

$data =<<<EOL
(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
EOL;

$lexer = new Lexer($data);
$array = buildTree($lexer, 0);
print_r($array);

function buildTree($lexer, $level)
{
    $subtrees = [];
    $markers = [];
    while (($token = $lexer->nextToken()) !== false) {
        if ($token == '(') {
            $subtrees[] = buildTree($lexer, $level);
        } elseif ($token == ')') {
            return buildNode($markers, $subtrees);
        } else {
            $markers[] = $token;
        }
    }

    return buildNode($markers, $subtrees);
}

function buildNode($markers, $subtrees)
{
    if (count($markers) && count($subtrees)) {
        return [$markers[0], $subtrees];
    } elseif (count($subtrees)) {
        return $subtrees;
    } else {
        return $markers;
    }
}

class Lexer
{
    private $data;

    private $matches;

    private $index = -1;

    public function __construct($data)
    {
        $this->data = $data;
        preg_match_all('/[\w]+|\(|\)/', $data, $matches);
        $this->matches = $matches[0];
    }

    public function nextToken()
    {
        $index = ++$this->index;
        if (isset($this->matches[$index]) === false) {
            return false;
        }
        return $this->matches[$index];
    }
}

输出

Array
(
    [0] => Array
        (
            [0] => S
            [1] => Array
                (
                    [0] => Array
                        (
                            [0] => NP
                            [1] => I
                        )

                    [1] => Array
                        (
                            [0] => VP
                            [1] => Array
                                (
                                    [0] => Array
                                        (
                                            [0] => VP
                                            [1] => Array
                                                (
                                                    [0] => Array
                                                        (
                                                            [0] => V
                                                            [1] => shot
                                                        )

                                                    [1] => Array
                                                        (
                                                            [0] => NP
                                                            [1] => Array
                                                                (
                                                                    [0] => Array
                                                                        (
                                                                            [0] => Det
                                                                            [1] => an
                                                                        )

                                                                    [1] => Array
                                                                        (
                                                                            [0] => N
                                                                            [1] => elephant
                                                                        )

                                                                )

                                                        )

                                                )

                                        )

                                    [1] => Array
                                        (
                                            [0] => PP
                                            [1] => Array
                                                (
                                                    [0] => Array
                                                        (
                                                            [0] => P
                                                            [1] => in
                                                        )

                                                    [1] => Array
                                                        (
                                                            [0] => NP
                                                            [1] => Array
                                                                (
                                                                    [0] => Array
                                                                        (
                                                                            [0] => Det
                                                                            [1] => my
                                                                        )

                                                                    [1] => Array
                                                                        (
                                                                            [0] => N
                                                                            [1] => pajamas
                                                                        )

                                                                )

                                                        )

                                                )

                                        )

                                )

                        )

                )

        )

)