如何在 XML export with lxml/xpath 中找到所有带有 IMG 标签的指南 ID 和页面？

Question

我如何解析下面的 XML 以便为每个 GUIDE 找到它的 ID 和 UL，然后为 GUIDE 中的每个页面找到页面 ID 和出现在 BOXES / BOX / ASSETS 中的任何图像/ 描述？图片是 HTML 格式的，所以我需要从每张图片中获取来源。

  <guide>
    <id></id>
   <url></url>
  <group>
   <id></id> 
<type></type>
<name></name>
   </group>
   <pages>
    <page>
 <id></id>
 <name></name>
 <description></description>
 <boxes>
  <box>
   <id></id>
   <name></name>
   <type></type>
   <map_id></map_id>
   <column></column>
   <position></position>
   <hidden></hidden>
   <created></created>
   <updated></updated>
   <assets>
    <asset>
     <id></id>
     <name></name>
     <type></type>
     <description></description>
     <url/>
     <owner>
      <id></id>
      <email></email>
      <first_name></first_name>
      <last_name></last_name>
     </owner>
    </asset>
      </assets>
     </box>
    </boxes>
   </page>
   </pages>
    </guide>

这为我提供了带有 ID 和描述的页面，但它是我需要访问的资产元素中的描述，并且 guide/page 它们处于打开状态。

from lxml import etree
tree = etree.parse('temp.xml')
for page in tree.xpath('.//page'):
    page.xpath('id')[0].text, page.xpath('description')[0].text

Answer 1

代码的模式可能相似，但我无法检查，因为我没有您的完整 xml。

>>> from lxml import etree
>>> tree = etree.parse('temp.xml')
>>> for guide in tree.xpath('guide'):
...     '---', guide.xpath('id')[0].text
...     for pages in guide.xpath('.//pages'):
...         for page in pages:
...             '------', page.xpath('id')[0].text
...             for description in page.xpath('.//asset/description'):
...                 '---------', description.text
... 
('---', 'guide 1')
('------', 'page 1')
('---------', 'description')

我假设您的 xml 会有多个 guide 元素。这是我解析的。

<guides>
    <guide>
        <id>guide 1</id>
        <url></url>
        <group>
        <id></id> 
        <type></type>
        <name></name>
        </group>
        <pages>
            <page>
                <id>page 1</id>
                <name></name>
                <description></description>
                <boxes>
                    <box>
                        <id></id>
                        <name></name>
                        <type></type>
                        <map_id></map_id>
                        <column></column>
                        <position></position>
                        <hidden></hidden>
                        <created></created>
                        <updated></updated>
                        <assets>
                            <asset>
                                <id></id>
                                <name></name>
                                <type></type>
                                <description>description</description>
                                <url/>
                                <owner>
                                    <id></id>
                                    <email></email>
                                    <first_name></first_name>
                                    <last_name></last_name>
                                </owner>
                            </asset>
                        </assets>
                    </box>
                </boxes>
            </page>
        </pages>
    </guide>
</guides>

我通过缩进 xml 让我的生活更轻松，这样我就可以辨别它的结构。

如何在 XML export with lxml/xpath 中找到所有带有 IMG 标签的指南 ID 和页面？

How to find all guide IDs and pages with IMG tags in XML export with lxml/xpath?

python

xpath

lxml