在 Perl 中,使用 XML::Twig 从相关节点中提取文本
In Perl, extract text from related nodes, using XML::Twig
以下是我要解析的 xml 文件:
<?xml version="1.0" encoding="UTF-8"?>
<topic id="yerus5" xmlns:ditaarch="http://dita.oasis-open.org/architecture/2005/">
<title/>
<shortdesc/>
<body>
<p><b>CCU_CNT_ADDR: (Address=0x004 Reset=32'h1)</b><table id="table_r5b_1xj_ts">
<tgroup cols="4">
<colspec colnum="1" colname="col1"/>
<colspec colnum="2" colname="col2"/>
<colspec colnum="3" colname="col3"/>
<colspec colnum="4" colname="col4"/>
<tbody>
<row>
<entry>Field</entry>
<entry>OFFSET</entry>
<entry>R/W Access</entry>
<entry>Description</entry>
</row>
<row>
<entry>reg2sm_cnt</entry>
<entry>15:0</entry>
<entry>R/W</entry>
<entry>Count Value to increment in the extenral memory at the specified location.
Default Value of 1. A Count value of 0 will clear the counter value</entry>
</row>
<row>
<entry>ccu2bus_endianess</entry>
<entry>24</entry>
<entry>R/W</entry>
<entry>Endianess of the data structure bit</entry>
</row></tbody>
</tgroup>
</table><b>CCU_STAT_ADDR: (Address=0x008 Reset=32'h0)</b><table id="table_mcc_1xj_ts">
<tgroup cols="4">
<colspec colnum="1" colname="col1"/>
<colspec colnum="2" colname="col2"/>
<colspec colnum="3" colname="col3"/>
<colspec colnum="4" colname="col4"/>
<tbody>
<row>
<entry>Field</entry>
<entry>OFFSET</entry>
<entry>R/W Access</entry>
<entry>Description</entry>
</row>
<row>
<entry>fifo_cnt</entry>
<entry>1:0</entry>
<entry>R</entry>
<entry>Status. 0x0 indicates that the engine is free. Will be 0x1 on a write to
address</entry>
</row>
<row>
<entry>rfifo_cnt</entry>
<entry>3:2</entry>
<entry>R</entry>
<entry>Status. 0x0 indicates there are no pending read values from CCU engine.</entry>
</row> </tbody>
</tgroup>
</table></p>
</body>
</topic>
在 运行 以下代码之后(在 可用):
use strict;
use warnings;
use XML::Twig;
use Data::Dumper;
my @headers;
my $column_to_show = 'Field';
sub process_row {
my %entries;
my ( $twig, $row ) = @_;
my @row_entries = map { $_->text } $row->children;
if (@headers) {
@entries{@headers} = @row_entries;
print $column_to_show, " => ", $entries{$column_to_show}, "\n";
}
else {
@headers = @row_entries;
}
}
my $twig = XML::Twig->new(
'pretty_print' => 'indented_a',
twig_handlers => { 'row' => \&process_row }
)->parsefile ( 'your_file.xml' );
我能够访问 <entry></entry>
的每个数据。
我无法提取每个 <b></b>
文本的详细信息。是的,我能够提取所有 <b></b>
文本。但无法分别为每个 <b></b>
提取 <row></row>
。以下是示例输出:
Name: CCU_CNT_ADDR: (Address=0x004 Reset=32'h1)
Field: reg2sm_cnt
OFFSET: 15:0
Access: R/W
Description: Count Value to increment in the extenral memory at the specified location. Default Value of 1. A Count value of 0 will clear the counter value
Filed: ccu2bus_endianess
OFFSET: 24
Access: R/W
Description: Endianess of the data structure bit
.
.
.
.
.
.
.
Name: CCU_STAT_ADDR: (Address=0x008 Reset=32'h0)
Field: fifo_cnt
.
.
.
.
.
.
.
我尝试了以下但它不起作用:
foreach my $b ( $twig -> get_xpath ("//b") ) # Extract text of <b></b>
{
print $b ->text, "\n";
foreach my $row ( $twig -> get_xpath ("//row") )
{
print $row ->text, "\n";
}
}
好吧,考虑到你的例子 - 它实际上有点烦人,因为 XML 没有 明确地 将 'heading' 与 'table' 相关联(例如,将它们封装在 XML 节点中)。
然而,您可以使用 prev_sibling
方法获取同一级别的前一个元素。
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new->parsefile ( 'your_file.xml' );
foreach my $table ( $twig->get_xpath('//table') ) {
my $header = $table->prev_sibling->text;
print "Name: $header\n";
my @headers;
foreach my $row ( $table->get_xpath("tgroup/tbody/row") ) {
my %entries;
my @row_entries = map { $_->text =~ s/\n\s+//rg; } $row->children;
if (@headers) {
@entries{@headers} = @row_entries;
foreach my $field (@headers) {
print "$field: $entries{$field}\n";
}
}
else {
@headers = @row_entries;
}
}
print "----\n";
}
注意 - 此 假设 'element before table
' 是 header。它适用于您的特定情况,但只有在 always 直接位于您要显示的 <table>
之前的元素时才会正常工作。
- 我们运行一个'foreach'循环,挑选出名为
table
的元素(你的样本中有两个。
- 每个table,我们假设前一个兄弟元素是header。在这种情况下,这是您的
<b>
元素。不过要小心,因为 <b>
在 HTML 中表示粗体并且是格式标记。
- 然后我们做与其他基本相同的事情 - 对于每个 table,分解行,这样我们就有一个 header 和一堆列,然后每行打印一个。
- 作为执行此操作的一部分,我使用正则表达式删除 'linefeed and whitespace' (
s/\n\s+//gr
),因为描述中的格式看起来有点 'off'。显然,如果不需要,您可以将其删除。 (注意 - 这只适用于较新的 perl 版本 - 5.14+ IIRC)
这会产生:
Name: CCU_CNT_ADDR: (Address=0x004 Reset=32'h1)
Field: reg2sm_cnt
OFFSET: 15:0
R/W Access: R/W
Description: Count Value to increment in the extenral memory at the specified location.Default Value of 1. A Count value of 0 will clear the counter value
Field: ccu2bus_endianess
OFFSET: 24
R/W Access: R/W
Description: Endianess of the data structure bit
----
Name: CCU_STAT_ADDR: (Address=0x008 Reset=32'h0)
Field: fifo_cnt
OFFSET: 1:0
R/W Access: R
Description: Status. 0x0 indicates that the engine is free. Will be 0x1 on a write toaddress
Field: rfifo_cnt
OFFSET: 3:2
R/W Access: R
Description: Status. 0x0 indicates there are no pending read values from CCU engine.
----
以下是我要解析的 xml 文件:
<?xml version="1.0" encoding="UTF-8"?>
<topic id="yerus5" xmlns:ditaarch="http://dita.oasis-open.org/architecture/2005/">
<title/>
<shortdesc/>
<body>
<p><b>CCU_CNT_ADDR: (Address=0x004 Reset=32'h1)</b><table id="table_r5b_1xj_ts">
<tgroup cols="4">
<colspec colnum="1" colname="col1"/>
<colspec colnum="2" colname="col2"/>
<colspec colnum="3" colname="col3"/>
<colspec colnum="4" colname="col4"/>
<tbody>
<row>
<entry>Field</entry>
<entry>OFFSET</entry>
<entry>R/W Access</entry>
<entry>Description</entry>
</row>
<row>
<entry>reg2sm_cnt</entry>
<entry>15:0</entry>
<entry>R/W</entry>
<entry>Count Value to increment in the extenral memory at the specified location.
Default Value of 1. A Count value of 0 will clear the counter value</entry>
</row>
<row>
<entry>ccu2bus_endianess</entry>
<entry>24</entry>
<entry>R/W</entry>
<entry>Endianess of the data structure bit</entry>
</row></tbody>
</tgroup>
</table><b>CCU_STAT_ADDR: (Address=0x008 Reset=32'h0)</b><table id="table_mcc_1xj_ts">
<tgroup cols="4">
<colspec colnum="1" colname="col1"/>
<colspec colnum="2" colname="col2"/>
<colspec colnum="3" colname="col3"/>
<colspec colnum="4" colname="col4"/>
<tbody>
<row>
<entry>Field</entry>
<entry>OFFSET</entry>
<entry>R/W Access</entry>
<entry>Description</entry>
</row>
<row>
<entry>fifo_cnt</entry>
<entry>1:0</entry>
<entry>R</entry>
<entry>Status. 0x0 indicates that the engine is free. Will be 0x1 on a write to
address</entry>
</row>
<row>
<entry>rfifo_cnt</entry>
<entry>3:2</entry>
<entry>R</entry>
<entry>Status. 0x0 indicates there are no pending read values from CCU engine.</entry>
</row> </tbody>
</tgroup>
</table></p>
</body>
</topic>
在 运行 以下代码之后(在
use strict;
use warnings;
use XML::Twig;
use Data::Dumper;
my @headers;
my $column_to_show = 'Field';
sub process_row {
my %entries;
my ( $twig, $row ) = @_;
my @row_entries = map { $_->text } $row->children;
if (@headers) {
@entries{@headers} = @row_entries;
print $column_to_show, " => ", $entries{$column_to_show}, "\n";
}
else {
@headers = @row_entries;
}
}
my $twig = XML::Twig->new(
'pretty_print' => 'indented_a',
twig_handlers => { 'row' => \&process_row }
)->parsefile ( 'your_file.xml' );
我能够访问 <entry></entry>
的每个数据。
我无法提取每个 <b></b>
文本的详细信息。是的,我能够提取所有 <b></b>
文本。但无法分别为每个 <b></b>
提取 <row></row>
。以下是示例输出:
Name: CCU_CNT_ADDR: (Address=0x004 Reset=32'h1)
Field: reg2sm_cnt
OFFSET: 15:0
Access: R/W
Description: Count Value to increment in the extenral memory at the specified location. Default Value of 1. A Count value of 0 will clear the counter value
Filed: ccu2bus_endianess
OFFSET: 24
Access: R/W
Description: Endianess of the data structure bit
.
.
.
.
.
.
.
Name: CCU_STAT_ADDR: (Address=0x008 Reset=32'h0)
Field: fifo_cnt
.
.
.
.
.
.
.
我尝试了以下但它不起作用:
foreach my $b ( $twig -> get_xpath ("//b") ) # Extract text of <b></b>
{
print $b ->text, "\n";
foreach my $row ( $twig -> get_xpath ("//row") )
{
print $row ->text, "\n";
}
}
好吧,考虑到你的例子 - 它实际上有点烦人,因为 XML 没有 明确地 将 'heading' 与 'table' 相关联(例如,将它们封装在 XML 节点中)。
然而,您可以使用 prev_sibling
方法获取同一级别的前一个元素。
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new->parsefile ( 'your_file.xml' );
foreach my $table ( $twig->get_xpath('//table') ) {
my $header = $table->prev_sibling->text;
print "Name: $header\n";
my @headers;
foreach my $row ( $table->get_xpath("tgroup/tbody/row") ) {
my %entries;
my @row_entries = map { $_->text =~ s/\n\s+//rg; } $row->children;
if (@headers) {
@entries{@headers} = @row_entries;
foreach my $field (@headers) {
print "$field: $entries{$field}\n";
}
}
else {
@headers = @row_entries;
}
}
print "----\n";
}
注意 - 此 假设 'element before table
' 是 header。它适用于您的特定情况,但只有在 always 直接位于您要显示的 <table>
之前的元素时才会正常工作。
- 我们运行一个'foreach'循环,挑选出名为
table
的元素(你的样本中有两个。 - 每个table,我们假设前一个兄弟元素是header。在这种情况下,这是您的
<b>
元素。不过要小心,因为<b>
在 HTML 中表示粗体并且是格式标记。 - 然后我们做与其他基本相同的事情 - 对于每个 table,分解行,这样我们就有一个 header 和一堆列,然后每行打印一个。
- 作为执行此操作的一部分,我使用正则表达式删除 'linefeed and whitespace' (
s/\n\s+//gr
),因为描述中的格式看起来有点 'off'。显然,如果不需要,您可以将其删除。 (注意 - 这只适用于较新的 perl 版本 - 5.14+ IIRC)
这会产生:
Name: CCU_CNT_ADDR: (Address=0x004 Reset=32'h1)
Field: reg2sm_cnt
OFFSET: 15:0
R/W Access: R/W
Description: Count Value to increment in the extenral memory at the specified location.Default Value of 1. A Count value of 0 will clear the counter value
Field: ccu2bus_endianess
OFFSET: 24
R/W Access: R/W
Description: Endianess of the data structure bit
----
Name: CCU_STAT_ADDR: (Address=0x008 Reset=32'h0)
Field: fifo_cnt
OFFSET: 1:0
R/W Access: R
Description: Status. 0x0 indicates that the engine is free. Will be 0x1 on a write toaddress
Field: rfifo_cnt
OFFSET: 3:2
R/W Access: R
Description: Status. 0x0 indicates there are no pending read values from CCU engine.
----