使用 lxml 解析 xml 文件
Parsing an xml file using lxml
我正在尝试通过查找每个 Watts 标签并更改其中的文本来编辑 xml 文件。到目前为止,我已经设法更改了所有标签,但没有具体更改 Watts 标签。
我的解析器是:
from lxml import etree
tree = etree.parse("cycling.xml")
root = tree.getroot()
for watt in root.iter():
if watt.tag == "Watts":
watt.text = "strong"
tree.write("output.xml")
这使我的 cycling.xml 文件保持不变。来自 output.xml 的片段(也是 cycling.xml 文件,因为它没有改变)是:
<TrainingCenterDatabase xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">
<Activities>
<Activity Sport="Biking">
<Id>2018-05-06T20:49:56Z</Id>
<Lap StartTime="2018-05-06T20:49:56Z">
<TotalTimeSeconds>2495.363</TotalTimeSeconds>
<DistanceMeters>15345</DistanceMeters>
<MaximumSpeed>18.4</MaximumSpeed>
<Calories>0</Calories>
<Intensity>Active</Intensity>
<TriggerMethod>Manual</TriggerMethod>
<Track>
<Trackpoint>
<Time>2018-05-06T20:49:56Z</Time>
<Position>
<LatitudeDegrees>49.319297</LatitudeDegrees>
<LongitudeDegrees>-123.024128</LongitudeDegrees>
</Position>
<HeartRateBpm>
<Value>99</Value>
</HeartRateBpm>
<Extensions>
<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">
<Watts>0</Watts>
<Speed>2</Speed>
</TPX>
</Extensions>
</Trackpoint>
如果我更改我的解析器以更改所有标签:
for watt in root.iter():
if watt.tag != "Watts":
watt.text = "strong"
然后我的 output.xml 文件变成:
<TrainingCenterDatabase xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">strong<Activities>strong<Activity Sport="Biking">strong<Id>strong</Id>
<Lap StartTime="2018-05-06T20:49:56Z">strong<TotalTimeSeconds>strong</TotalTimeSeconds>
<DistanceMeters>strong</DistanceMeters>
<MaximumSpeed>strong</MaximumSpeed>
<Calories>strong</Calories>
<Intensity>strong</Intensity>
<TriggerMethod>strong</TriggerMethod>
<Track>strong<Trackpoint>strong<Time>strong</Time>
<Position>strong<LatitudeDegrees>strong</LatitudeDegrees>
<LongitudeDegrees>strong</LongitudeDegrees>
</Position>
<HeartRateBpm>strong<Value>strong</Value>
</HeartRateBpm>
<Extensions>strong<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">strong<Watts>strong</Watts>
<Speed>strong</Speed>
</TPX>
</Extensions>
</Trackpoint>
<Trackpoint>strong<Time>strong</Time>
<Position>strong<LatitudeDegrees>strong</LatitudeDegrees>
<LongitudeDegrees>strong</LongitudeDegrees>
</Position>
<AltitudeMeters>strong</AltitudeMeters>
<HeartRateBpm>strong<Value>strong</Value>
</HeartRateBpm>
<Extensions>strong<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">strong<Watts>strong</Watts>
<Speed>strong</Speed>
</TPX>
</Extensions>
</Trackpoint>
- 我怎样才能只更改瓦特标签?
- 我不明白
root = tree.getroot()
的作用。我只是想我会同时问这个问题,虽然我不确定它对我的特定问题是否重要。
您的文档定义了一个默认的 XML 命名空间。看开始标签末尾的xmlns=
属性:
<TrainingCenterDatabase
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">
这意味着您的文档中没有名为 "Watts" 的元素;您需要使用适当的命名空间来限定标签名称。如果你在我们的循环中打印出 watt.tag
的值,你会看到:
$ python filter.py
{http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2}TrainingCenterDatabase
[...]
{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Watts
{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Speed
考虑到这一点,您可以修改过滤器,使其看起来像
这个:
from lxml import etree
tree = etree.parse("cycling.xml")
root = tree.getroot()
for watt in root.iter():
if watt.tag == "{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Watts":
watt.text = "strong"
tree.write("output.xml")
您可以在 lxml documentation.
中阅读有关命名空间处理的更多信息
或者,由于您使用了两个重要的词 edit xml 并且您正在使用 lxml
,请考虑 XSLT(XML 转换语言),您可以在其中定义名称空间前缀并在文档中的任何位置更改 Watts 而无需循环。另外,您可以将值从 Python!
传递到 XSLT
XSLT (另存为 .xsl 文件)
<?xml version="1.0"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://www.garmin.com/xmlschemas/ActivityExtension/v2" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- VALUE TO BE PASSED INTO FROM PYTHON -->
<xsl:param name="python_value">
<!-- Identity Transform -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- ADJUST WATTS TEXT -->
<xsl:template match="doc:Watts">
<xsl:copy><xsl:value-of select="$python_value"/></xsl:copy>
</xsl:template>
</xsl:transform>
Python
from lxml import etree
# LOAD XML AND XSL
doc = etree.parse("cycling.xml")
xsl = etree.parse('XSLT_Script.xsl')
# CONFIGURE TRANSFORMER
transform = etree.XSLT(xsl)
# RUN TRANSFORMATION WITH PARAM
n = etree.XSLT.strparam('Strong')
result = transform(doc, python_value=n)
# PRINT TO CONSOLE
print(result)
# SAVE TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)
我正在尝试通过查找每个 Watts 标签并更改其中的文本来编辑 xml 文件。到目前为止,我已经设法更改了所有标签,但没有具体更改 Watts 标签。
我的解析器是:
from lxml import etree
tree = etree.parse("cycling.xml")
root = tree.getroot()
for watt in root.iter():
if watt.tag == "Watts":
watt.text = "strong"
tree.write("output.xml")
这使我的 cycling.xml 文件保持不变。来自 output.xml 的片段(也是 cycling.xml 文件,因为它没有改变)是:
<TrainingCenterDatabase xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">
<Activities>
<Activity Sport="Biking">
<Id>2018-05-06T20:49:56Z</Id>
<Lap StartTime="2018-05-06T20:49:56Z">
<TotalTimeSeconds>2495.363</TotalTimeSeconds>
<DistanceMeters>15345</DistanceMeters>
<MaximumSpeed>18.4</MaximumSpeed>
<Calories>0</Calories>
<Intensity>Active</Intensity>
<TriggerMethod>Manual</TriggerMethod>
<Track>
<Trackpoint>
<Time>2018-05-06T20:49:56Z</Time>
<Position>
<LatitudeDegrees>49.319297</LatitudeDegrees>
<LongitudeDegrees>-123.024128</LongitudeDegrees>
</Position>
<HeartRateBpm>
<Value>99</Value>
</HeartRateBpm>
<Extensions>
<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">
<Watts>0</Watts>
<Speed>2</Speed>
</TPX>
</Extensions>
</Trackpoint>
如果我更改我的解析器以更改所有标签:
for watt in root.iter():
if watt.tag != "Watts":
watt.text = "strong"
然后我的 output.xml 文件变成:
<TrainingCenterDatabase xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">strong<Activities>strong<Activity Sport="Biking">strong<Id>strong</Id>
<Lap StartTime="2018-05-06T20:49:56Z">strong<TotalTimeSeconds>strong</TotalTimeSeconds>
<DistanceMeters>strong</DistanceMeters>
<MaximumSpeed>strong</MaximumSpeed>
<Calories>strong</Calories>
<Intensity>strong</Intensity>
<TriggerMethod>strong</TriggerMethod>
<Track>strong<Trackpoint>strong<Time>strong</Time>
<Position>strong<LatitudeDegrees>strong</LatitudeDegrees>
<LongitudeDegrees>strong</LongitudeDegrees>
</Position>
<HeartRateBpm>strong<Value>strong</Value>
</HeartRateBpm>
<Extensions>strong<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">strong<Watts>strong</Watts>
<Speed>strong</Speed>
</TPX>
</Extensions>
</Trackpoint>
<Trackpoint>strong<Time>strong</Time>
<Position>strong<LatitudeDegrees>strong</LatitudeDegrees>
<LongitudeDegrees>strong</LongitudeDegrees>
</Position>
<AltitudeMeters>strong</AltitudeMeters>
<HeartRateBpm>strong<Value>strong</Value>
</HeartRateBpm>
<Extensions>strong<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">strong<Watts>strong</Watts>
<Speed>strong</Speed>
</TPX>
</Extensions>
</Trackpoint>
- 我怎样才能只更改瓦特标签?
- 我不明白
root = tree.getroot()
的作用。我只是想我会同时问这个问题,虽然我不确定它对我的特定问题是否重要。
您的文档定义了一个默认的 XML 命名空间。看开始标签末尾的xmlns=
属性:
<TrainingCenterDatabase
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">
这意味着您的文档中没有名为 "Watts" 的元素;您需要使用适当的命名空间来限定标签名称。如果你在我们的循环中打印出 watt.tag
的值,你会看到:
$ python filter.py
{http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2}TrainingCenterDatabase
[...]
{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Watts
{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Speed
考虑到这一点,您可以修改过滤器,使其看起来像 这个:
from lxml import etree
tree = etree.parse("cycling.xml")
root = tree.getroot()
for watt in root.iter():
if watt.tag == "{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Watts":
watt.text = "strong"
tree.write("output.xml")
您可以在 lxml documentation.
中阅读有关命名空间处理的更多信息或者,由于您使用了两个重要的词 edit xml 并且您正在使用 lxml
,请考虑 XSLT(XML 转换语言),您可以在其中定义名称空间前缀并在文档中的任何位置更改 Watts 而无需循环。另外,您可以将值从 Python!
XSLT (另存为 .xsl 文件)
<?xml version="1.0"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://www.garmin.com/xmlschemas/ActivityExtension/v2" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- VALUE TO BE PASSED INTO FROM PYTHON -->
<xsl:param name="python_value">
<!-- Identity Transform -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- ADJUST WATTS TEXT -->
<xsl:template match="doc:Watts">
<xsl:copy><xsl:value-of select="$python_value"/></xsl:copy>
</xsl:template>
</xsl:transform>
Python
from lxml import etree
# LOAD XML AND XSL
doc = etree.parse("cycling.xml")
xsl = etree.parse('XSLT_Script.xsl')
# CONFIGURE TRANSFORMER
transform = etree.XSLT(xsl)
# RUN TRANSFORMATION WITH PARAM
n = etree.XSLT.strparam('Strong')
result = transform(doc, python_value=n)
# PRINT TO CONSOLE
print(result)
# SAVE TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)