如何将opening/closing标签变成PHP中的关联数组?
How to turn opening/closing tags into an associative array in PHP?
我有一个自然语言处理解析树作为
(S
(NP I)
(VP
(VP (V shot) (NP (Det an) (N elephant)))
(PP (P in) (NP (Det my) (N pajamas)))))
我想把它存储在一个关联数组中,但是在PHP中没有函数,因为NLP通常在python中完成。
因此,我应该解析左括号和右括号来构建一个树结构的关联数组。我可以想到两个选择
- 用任意 XML 或 HTML 标签替换括号,并将其解析为 XML 或 HTML 文档。
- 使用正则表达式。
我认为第一种方法不标准,正则表达式模式在复杂情况下可能会中断。
你能推荐一个可靠的方法吗?
关联数组可以有任何形式,因为它不难操作(我需要它在一个循环中),但它可以像
Array (
[0] = > word => ROOT, tag => S, children => Array (
[0] word => I, tag = > NP, children => Array()
[1] word => ROOT, tag => VP, children => Array (
[0] => word => ROOT, tag => VP, children => Array ( .... )
[1] => word => ROOT, tag => PP, children => Array ( .... )
)
)
)
也可以是
Array (
[0] = > Array([0] => S, [1] => Array (
[0] Array([0] => NP, [1] => 'I') // child array is replaced by a string
[1] Array([0] => VP, [1] => Array (
[0] => Array([0] => VP, [1] => Array ( .... )
[1] => Array([0] => PP, [1] => Array ( .... )
)
)
喜欢使用词法分析器生成器bison or flex or just write your own lexer by hands, This answer 有一些您需要的有用信息。
这是一个用 PHP 编写的快速但不完整的 POC 片段,它将按预期输出关联数组。
$data =<<<EOL
(S
(NP I)
(VP
(VP (V shot) (NP (Det an) (N elephant)))
(PP (P in) (NP (Det my) (N pajamas)))))
EOL;
$lexer = new Lexer($data);
$array = buildTree($lexer, 0);
print_r($array);
function buildTree($lexer, $level)
{
$subtrees = [];
$markers = [];
while (($token = $lexer->nextToken()) !== false) {
if ($token == '(') {
$subtrees[] = buildTree($lexer, $level);
} elseif ($token == ')') {
return buildNode($markers, $subtrees);
} else {
$markers[] = $token;
}
}
return buildNode($markers, $subtrees);
}
function buildNode($markers, $subtrees)
{
if (count($markers) && count($subtrees)) {
return [$markers[0], $subtrees];
} elseif (count($subtrees)) {
return $subtrees;
} else {
return $markers;
}
}
class Lexer
{
private $data;
private $matches;
private $index = -1;
public function __construct($data)
{
$this->data = $data;
preg_match_all('/[\w]+|\(|\)/', $data, $matches);
$this->matches = $matches[0];
}
public function nextToken()
{
$index = ++$this->index;
if (isset($this->matches[$index]) === false) {
return false;
}
return $this->matches[$index];
}
}
输出
Array
(
[0] => Array
(
[0] => S
[1] => Array
(
[0] => Array
(
[0] => NP
[1] => I
)
[1] => Array
(
[0] => VP
[1] => Array
(
[0] => Array
(
[0] => VP
[1] => Array
(
[0] => Array
(
[0] => V
[1] => shot
)
[1] => Array
(
[0] => NP
[1] => Array
(
[0] => Array
(
[0] => Det
[1] => an
)
[1] => Array
(
[0] => N
[1] => elephant
)
)
)
)
)
[1] => Array
(
[0] => PP
[1] => Array
(
[0] => Array
(
[0] => P
[1] => in
)
[1] => Array
(
[0] => NP
[1] => Array
(
[0] => Array
(
[0] => Det
[1] => my
)
[1] => Array
(
[0] => N
[1] => pajamas
)
)
)
)
)
)
)
)
)
)
我有一个自然语言处理解析树作为
(S
(NP I)
(VP
(VP (V shot) (NP (Det an) (N elephant)))
(PP (P in) (NP (Det my) (N pajamas)))))
我想把它存储在一个关联数组中,但是在PHP中没有函数,因为NLP通常在python中完成。
因此,我应该解析左括号和右括号来构建一个树结构的关联数组。我可以想到两个选择
- 用任意 XML 或 HTML 标签替换括号,并将其解析为 XML 或 HTML 文档。
- 使用正则表达式。
我认为第一种方法不标准,正则表达式模式在复杂情况下可能会中断。
你能推荐一个可靠的方法吗?
关联数组可以有任何形式,因为它不难操作(我需要它在一个循环中),但它可以像
Array (
[0] = > word => ROOT, tag => S, children => Array (
[0] word => I, tag = > NP, children => Array()
[1] word => ROOT, tag => VP, children => Array (
[0] => word => ROOT, tag => VP, children => Array ( .... )
[1] => word => ROOT, tag => PP, children => Array ( .... )
)
)
)
也可以是
Array (
[0] = > Array([0] => S, [1] => Array (
[0] Array([0] => NP, [1] => 'I') // child array is replaced by a string
[1] Array([0] => VP, [1] => Array (
[0] => Array([0] => VP, [1] => Array ( .... )
[1] => Array([0] => PP, [1] => Array ( .... )
)
)
喜欢使用词法分析器生成器bison or flex or just write your own lexer by hands, This answer 有一些您需要的有用信息。
这是一个用 PHP 编写的快速但不完整的 POC 片段,它将按预期输出关联数组。
$data =<<<EOL
(S
(NP I)
(VP
(VP (V shot) (NP (Det an) (N elephant)))
(PP (P in) (NP (Det my) (N pajamas)))))
EOL;
$lexer = new Lexer($data);
$array = buildTree($lexer, 0);
print_r($array);
function buildTree($lexer, $level)
{
$subtrees = [];
$markers = [];
while (($token = $lexer->nextToken()) !== false) {
if ($token == '(') {
$subtrees[] = buildTree($lexer, $level);
} elseif ($token == ')') {
return buildNode($markers, $subtrees);
} else {
$markers[] = $token;
}
}
return buildNode($markers, $subtrees);
}
function buildNode($markers, $subtrees)
{
if (count($markers) && count($subtrees)) {
return [$markers[0], $subtrees];
} elseif (count($subtrees)) {
return $subtrees;
} else {
return $markers;
}
}
class Lexer
{
private $data;
private $matches;
private $index = -1;
public function __construct($data)
{
$this->data = $data;
preg_match_all('/[\w]+|\(|\)/', $data, $matches);
$this->matches = $matches[0];
}
public function nextToken()
{
$index = ++$this->index;
if (isset($this->matches[$index]) === false) {
return false;
}
return $this->matches[$index];
}
}
输出
Array
(
[0] => Array
(
[0] => S
[1] => Array
(
[0] => Array
(
[0] => NP
[1] => I
)
[1] => Array
(
[0] => VP
[1] => Array
(
[0] => Array
(
[0] => VP
[1] => Array
(
[0] => Array
(
[0] => V
[1] => shot
)
[1] => Array
(
[0] => NP
[1] => Array
(
[0] => Array
(
[0] => Det
[1] => an
)
[1] => Array
(
[0] => N
[1] => elephant
)
)
)
)
)
[1] => Array
(
[0] => PP
[1] => Array
(
[0] => Array
(
[0] => P
[1] => in
)
[1] => Array
(
[0] => NP
[1] => Array
(
[0] => Array
(
[0] => Det
[1] => my
)
[1] => Array
(
[0] => N
[1] => pajamas
)
)
)
)
)
)
)
)
)
)