如何用 Python 解析复杂的 XML
How to parse a complicated XML with Python
我正在努力将 XML 文件转换为 CSV 或 pandas 文件。 XML 中有各种必要的类别和不需要的类别。是否有一种有效的方法可以按照以下格式从代码中提取信息。这需要在> 10,000 个文档的相对较大的规模上完成。例如,我想获取 "family-id"、"data" 和
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE patent-document\n PUBLIC "-//MXW//DTD patent-document XML//EN"
"http://www.ir-facility.org/dtds/patents/v1.4/patent-document.dtd">
<patent-document ucid="US-20030137706-A1" country="US" doc-
number="20030137706" kind="A1" lang="EN" family-id="10973265" status="new"
date-produced="20090605" date="20030724">
<bibliographic-data>
<publication-reference ucid="US-20030137706-A1" status="new"
fvid="76030147">
<document-id status="new" format="original">
<country>US</country>
<doc-number>20030137706</doc-number>
<kind>A1</kind>
<date>20030724</date>
</document-id>
</publication-reference>
<application-reference ucid="US-18203002-A" status="new" is-representative="NO">
<document-id status="new" format="epo">
<country>US</country>
<doc-number>18203002</doc-number>
<kind>A</kind>
<date>20021204</date>
</document-id>
</application-reference>
<priority-claims status="new">
<priority-claim ucid="HU-0000532-A" status="new">
<document-id status="new" format="epo">
<country>HU</country>
<doc-number>0000532</doc-number>
<kind>A</kind>
<date>20000207</date>
</document-id>
</priority-claim>
<priority-claim ucid="HU-0100016-W" status="new">
</abstract>
<description load-source="us" status="new" lang="EN">
<heading>TECHNICAL FIELD </heading>
<p>[0001] The object of the invention is a method for the holographic
recording of data. In the method a hologram containing the date is
recorded in a waveguide layer as an interference between an object beam
and a reference beam. The object beam is essentially perpendicular to
the plane of the hologram, while the reference beam is coupled in the
waveguide. There is also proposed an apparatus for performing the
method. The apparatus comprises a data storage medium with a waveguide
holographic storage layer, and an optical system for writing and reading
the holograms. The optical system comprises means for producing an
object beam and a reference beam, and imaging the object beam and a
reference beam on the storage medium. </p>
<heading>BACKGROUND ART </heading>
<p>[0002] Storage systems realised with tapes stand out from other data
storage systems regarding their immense storage capacity. Such systems
were used to realise the storage of data in the order of Terabytes.
This large storage capacity is achieved partly by the storage density,
and partly by the length of the storage tapes. The relative space
requirements of tapes are small, because they may be wound up into a
very small volume. Their disadvantage is the relatively large random
access time. </p>
我强烈建议使用优秀的 lxml.etree
库!它非常快,因为它是 C 库 libxml2 和 libxslt 的包装器。
用法示例:
import lxml.etree
text = '''\
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE patent-document\n PUBLIC "-//MXW//DTD patent-document XML//EN"
"http://www.ir-facility.org/dtds/patents/v1.4/patent-document.dtd">
<patent-document ucid="US-20030137706-A1" country="US" doc-number="20030137706" kind="A1" lang="EN" family-id="10973265" status="new"
date-produced="20090605" date="20030724">
<bibliographic-data>
<publication-reference ucid="US-20030137706-A1" status="new"
fvid="76030147">
<document-id status="new" format="original">
<country>US</country>
<doc-number>20030137706</doc-number>
<kind>A1</kind>
<date>20030724</date>
</document-id>
</publication-reference>
<application-reference ucid="US-18203002-A" status="new" is-representative="NO">
<document-id status="new" format="epo">
<country>US</country>
<doc-number>18203002</doc-number>
<kind>A</kind>
<date>20021204</date>
</document-id>
</application-reference>
<priority-claims status="new">
<priority-claim ucid="HU-0000532-A" status="new">
<document-id status="new" format="epo">
<country>HU</country>
<doc-number>0000532</doc-number>
<kind>A</kind>
<date>20000207</date>
</document-id>
</priority-claim>
<description load-source="us" status="new" lang="EN">
<heading>TECHNICAL FIELD </heading>
<p>[0001] The object of the invention is a method for the holographic
recording of data. In the method a hologram containing the date is
recorded in a waveguide layer as an interference between an object beam
and a reference beam. The object beam is essentially perpendicular to
the plane of the hologram, while the reference beam is coupled in the
waveguide. There is also proposed an apparatus for performing the
method. The apparatus comprises a data storage medium with a waveguide
holographic storage layer, and an optical system for writing and reading
the holograms. The optical system comprises means for producing an
object beam and a reference beam, and imaging the object beam and a
reference beam on the storage medium. </p>
<heading>BACKGROUND ART </heading>
<p>[0002] Storage systems realised with tapes stand out from other data
storage systems regarding their immense storage capacity. Such systems
were used to realise the storage of data in the order of Terabytes.
This large storage capacity is achieved partly by the storage density,
and partly by the length of the storage tapes. The relative space
requirements of tapes are small, because they may be wound up into a
very small volume. Their disadvantage is the relatively large random
access time. </p>
</description>
</priority-claims>
</bibliographic-data>
</patent-document>
'''.encode('utf-8') # the library wants bytes so we encode
# ^^ you don't need this if reading from a file
doc = lxml.etree.fromstring(text)
测试:
>>> print(doc.xpath('//patent-document/@family-id'))
['10973265']
>>> print(doc.xpath('//patent-document/@date'))
['20030724']
我正在努力将 XML 文件转换为 CSV 或 pandas 文件。 XML 中有各种必要的类别和不需要的类别。是否有一种有效的方法可以按照以下格式从代码中提取信息。这需要在> 10,000 个文档的相对较大的规模上完成。例如,我想获取 "family-id"、"data" 和
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE patent-document\n PUBLIC "-//MXW//DTD patent-document XML//EN"
"http://www.ir-facility.org/dtds/patents/v1.4/patent-document.dtd">
<patent-document ucid="US-20030137706-A1" country="US" doc-
number="20030137706" kind="A1" lang="EN" family-id="10973265" status="new"
date-produced="20090605" date="20030724">
<bibliographic-data>
<publication-reference ucid="US-20030137706-A1" status="new"
fvid="76030147">
<document-id status="new" format="original">
<country>US</country>
<doc-number>20030137706</doc-number>
<kind>A1</kind>
<date>20030724</date>
</document-id>
</publication-reference>
<application-reference ucid="US-18203002-A" status="new" is-representative="NO">
<document-id status="new" format="epo">
<country>US</country>
<doc-number>18203002</doc-number>
<kind>A</kind>
<date>20021204</date>
</document-id>
</application-reference>
<priority-claims status="new">
<priority-claim ucid="HU-0000532-A" status="new">
<document-id status="new" format="epo">
<country>HU</country>
<doc-number>0000532</doc-number>
<kind>A</kind>
<date>20000207</date>
</document-id>
</priority-claim>
<priority-claim ucid="HU-0100016-W" status="new">
</abstract>
<description load-source="us" status="new" lang="EN">
<heading>TECHNICAL FIELD </heading>
<p>[0001] The object of the invention is a method for the holographic
recording of data. In the method a hologram containing the date is
recorded in a waveguide layer as an interference between an object beam
and a reference beam. The object beam is essentially perpendicular to
the plane of the hologram, while the reference beam is coupled in the
waveguide. There is also proposed an apparatus for performing the
method. The apparatus comprises a data storage medium with a waveguide
holographic storage layer, and an optical system for writing and reading
the holograms. The optical system comprises means for producing an
object beam and a reference beam, and imaging the object beam and a
reference beam on the storage medium. </p>
<heading>BACKGROUND ART </heading>
<p>[0002] Storage systems realised with tapes stand out from other data
storage systems regarding their immense storage capacity. Such systems
were used to realise the storage of data in the order of Terabytes.
This large storage capacity is achieved partly by the storage density,
and partly by the length of the storage tapes. The relative space
requirements of tapes are small, because they may be wound up into a
very small volume. Their disadvantage is the relatively large random
access time. </p>
我强烈建议使用优秀的 lxml.etree
库!它非常快,因为它是 C 库 libxml2 和 libxslt 的包装器。
用法示例:
import lxml.etree
text = '''\
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE patent-document\n PUBLIC "-//MXW//DTD patent-document XML//EN"
"http://www.ir-facility.org/dtds/patents/v1.4/patent-document.dtd">
<patent-document ucid="US-20030137706-A1" country="US" doc-number="20030137706" kind="A1" lang="EN" family-id="10973265" status="new"
date-produced="20090605" date="20030724">
<bibliographic-data>
<publication-reference ucid="US-20030137706-A1" status="new"
fvid="76030147">
<document-id status="new" format="original">
<country>US</country>
<doc-number>20030137706</doc-number>
<kind>A1</kind>
<date>20030724</date>
</document-id>
</publication-reference>
<application-reference ucid="US-18203002-A" status="new" is-representative="NO">
<document-id status="new" format="epo">
<country>US</country>
<doc-number>18203002</doc-number>
<kind>A</kind>
<date>20021204</date>
</document-id>
</application-reference>
<priority-claims status="new">
<priority-claim ucid="HU-0000532-A" status="new">
<document-id status="new" format="epo">
<country>HU</country>
<doc-number>0000532</doc-number>
<kind>A</kind>
<date>20000207</date>
</document-id>
</priority-claim>
<description load-source="us" status="new" lang="EN">
<heading>TECHNICAL FIELD </heading>
<p>[0001] The object of the invention is a method for the holographic
recording of data. In the method a hologram containing the date is
recorded in a waveguide layer as an interference between an object beam
and a reference beam. The object beam is essentially perpendicular to
the plane of the hologram, while the reference beam is coupled in the
waveguide. There is also proposed an apparatus for performing the
method. The apparatus comprises a data storage medium with a waveguide
holographic storage layer, and an optical system for writing and reading
the holograms. The optical system comprises means for producing an
object beam and a reference beam, and imaging the object beam and a
reference beam on the storage medium. </p>
<heading>BACKGROUND ART </heading>
<p>[0002] Storage systems realised with tapes stand out from other data
storage systems regarding their immense storage capacity. Such systems
were used to realise the storage of data in the order of Terabytes.
This large storage capacity is achieved partly by the storage density,
and partly by the length of the storage tapes. The relative space
requirements of tapes are small, because they may be wound up into a
very small volume. Their disadvantage is the relatively large random
access time. </p>
</description>
</priority-claims>
</bibliographic-data>
</patent-document>
'''.encode('utf-8') # the library wants bytes so we encode
# ^^ you don't need this if reading from a file
doc = lxml.etree.fromstring(text)
测试:
>>> print(doc.xpath('//patent-document/@family-id'))
['10973265']
>>> print(doc.xpath('//patent-document/@date'))
['20030724']