如何递归获取lxml中的特定元素和子元素?
how to get speceific elements and sub elements in lxml recursively?
我有这个 xml 文件,它是这样的(当然它是 xml 文件的一小部分)和文章 ID
<article-set xmlns:ns0="http://casfwcewf.xsd" format-version="5">
<article>
<article id="11234">
<source>
<hostname>some hostname for 11234</hostname>
</source>
<feed>
<type>RSS</type>
</feed>
<uri>some uri for 11234</uri>
</article>
<article id="63563">
<source>
<hostname>some hostname for 63563 </hostname>
</source>
<feed>
<type>RSS</type>
</feed>
<uri>some uri for 63563</uri>
</article>
.
.
.
</article></article-set>
我想要的是为整个文档打印每篇文章 ID 及其特定的主机名和 uri(像这样)。
id=11234
uri= some uri for 11234
source=some hostname for 11234
id=63563
uri= some uri for 63563
source=some hostname for 63563
.
.
.
我用这段代码来做到这一点,
from lxml import etree
tree = etree.parse("C:\Users\me\Desktop\public.xml")
for article in tree.iter('article'):
article_id=article.attrib.get('id')
uri= tree.xpath("//article[@id]/uri/text()")
source= tree.xpath("//article[@id]/source/hostname/text()")
#i even used these two codes
#source=article.attrib.get('hostname')
#source = etree.SubElement(article, "hostname")
print('id={!s}'.format(article_id),"\n")
print('uri={!s}'.format(uri),"\n")
print('source={!s}'.format(source),"\n")
它没有用,有人可以帮我解决这个问题吗?
很可能有一些更聪明的写法;但是,这似乎确实有效。
>>> for article in tree.iter('article'):
... article_id = article.attrib.get('id')
... uri = tree.xpath("//article[@id={}]/uri/text()".format(article_id))
... source = tree.xpath("//article[@id={}]/source/hostname/text()".format(article_id))
... article_id, uri, source
...
('11234', ['some uri for 11234'], ['some hostname for 11234'])
('63563', ['some uri for 63563'], ['some hostname for 63563 '])
顺便说一下,我更改了 xml,以便容器元素内的元素是 <articles>
(而不是 <article>
)。像这样:
<article-set xmlns:ns0="http://casfwcewf.xsd" format-version="5">
<articles>
<article id="11234">
<source>
...
我有这个 xml 文件,它是这样的(当然它是 xml 文件的一小部分)和文章 ID
<article-set xmlns:ns0="http://casfwcewf.xsd" format-version="5">
<article>
<article id="11234">
<source>
<hostname>some hostname for 11234</hostname>
</source>
<feed>
<type>RSS</type>
</feed>
<uri>some uri for 11234</uri>
</article>
<article id="63563">
<source>
<hostname>some hostname for 63563 </hostname>
</source>
<feed>
<type>RSS</type>
</feed>
<uri>some uri for 63563</uri>
</article>
.
.
.
</article></article-set>
我想要的是为整个文档打印每篇文章 ID 及其特定的主机名和 uri(像这样)。
id=11234
uri= some uri for 11234
source=some hostname for 11234
id=63563
uri= some uri for 63563
source=some hostname for 63563
.
.
.
我用这段代码来做到这一点,
from lxml import etree
tree = etree.parse("C:\Users\me\Desktop\public.xml")
for article in tree.iter('article'):
article_id=article.attrib.get('id')
uri= tree.xpath("//article[@id]/uri/text()")
source= tree.xpath("//article[@id]/source/hostname/text()")
#i even used these two codes
#source=article.attrib.get('hostname')
#source = etree.SubElement(article, "hostname")
print('id={!s}'.format(article_id),"\n")
print('uri={!s}'.format(uri),"\n")
print('source={!s}'.format(source),"\n")
它没有用,有人可以帮我解决这个问题吗?
很可能有一些更聪明的写法;但是,这似乎确实有效。
>>> for article in tree.iter('article'):
... article_id = article.attrib.get('id')
... uri = tree.xpath("//article[@id={}]/uri/text()".format(article_id))
... source = tree.xpath("//article[@id={}]/source/hostname/text()".format(article_id))
... article_id, uri, source
...
('11234', ['some uri for 11234'], ['some hostname for 11234'])
('63563', ['some uri for 63563'], ['some hostname for 63563 '])
顺便说一下,我更改了 xml,以便容器元素内的元素是 <articles>
(而不是 <article>
)。像这样:
<article-set xmlns:ns0="http://casfwcewf.xsd" format-version="5">
<articles>
<article id="11234">
<source>
...