在解析 XML 文件时,有没有办法使用 lxml.etree 跳过第一个条目或在特定子项处开始迭代?
Is there a way using lxml.etree to skip the first entry or start the iteration at a specific child when parsing an XML file?
我目前正在使用 xlml.etree 包中的 .iter 方法 Python 来解析 XML 文件。有没有办法使用 XPath 之类的方法跳过第一个条目或在特定子项处开始迭代?
我研究了 itertext 和 iterparse 方法,但根据它们的定义,我不确定它是否会比帮助将 iter 缩小到特定标签更有效,我已经完成了这一点。
import lxml.etree as et
parsedXML = et.parse(file_path)
for child in parsedXML.iter('{http://www.witsml.org/schemas/131}data'):
代码成功解析了 XML 文件,但我想通过跳过空行或缺少足够数量字符的行(全部以逗号分隔)来减少时间。
<logData>
<data>63653079886,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079887,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079888,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079889,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079890,,29.3,155.8,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079891,,29.3,155.7,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079892,,29.3,155.8,12.25,0.0,0,0,93.76,-87.65,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
除每行的 11 位数字值外,还有几行和几行是空的。我想跳过它并在本例中第一个具有 12.25 值的行开始 iter(示例中的第 5 行)。
由于只有 11 位值和逗号(没有任何空格)的 data
元素是 34 个字符,您可以测试 string length in a predicate:
data[string-length(translate(.,' ','')) > 34]
我在检查字符串长度之前使用 translate()
删除了所有空格。
示例...
XML 输入 (input.xml)
<logData>
<data>63653079886,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079887,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079888,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079889,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079889, , , , , , , , , , , , , , , , , , , , , , ,</data>
<data>63653079890,,29.3,155.8,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079891,,29.3,155.7,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079892,,29.3,155.8,12.25,0.0,0,0,93.76,-87.65,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
</logData>
Python(我使用 XMLParser() 使打印输出更好。这不是绝对必要的。)
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse("input.xml", parser=parser)
for data in tree.xpath("data[string-length(translate(.,' ','')) > 34]"):
print(etree.tostring(data).decode())
输出(打印到控制台)
<data>63653079890,,29.3,155.8,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079891,,29.3,155.7,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079892,,29.3,155.8,12.25,0.0,0,0,93.76,-87.65,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
如果您真的想测试 12.25
值,当它之前的值的字符串长度未知时,它在 XPath 1.0 谓词中有点混乱。您可以使用一系列 substring-afters()'s inside a substring-before() 来完成。虽然不好看...
xpath("data[substring-before(substring-after(substring-after(substring-after(substring-after(translate(.,' ',''),','),','),','),','),',') = '12.25']")
我目前正在使用 xlml.etree 包中的 .iter 方法 Python 来解析 XML 文件。有没有办法使用 XPath 之类的方法跳过第一个条目或在特定子项处开始迭代?
我研究了 itertext 和 iterparse 方法,但根据它们的定义,我不确定它是否会比帮助将 iter 缩小到特定标签更有效,我已经完成了这一点。
import lxml.etree as et
parsedXML = et.parse(file_path)
for child in parsedXML.iter('{http://www.witsml.org/schemas/131}data'):
代码成功解析了 XML 文件,但我想通过跳过空行或缺少足够数量字符的行(全部以逗号分隔)来减少时间。
<logData>
<data>63653079886,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079887,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079888,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079889,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079890,,29.3,155.8,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079891,,29.3,155.7,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079892,,29.3,155.8,12.25,0.0,0,0,93.76,-87.65,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
除每行的 11 位数字值外,还有几行和几行是空的。我想跳过它并在本例中第一个具有 12.25 值的行开始 iter(示例中的第 5 行)。
由于只有 11 位值和逗号(没有任何空格)的 data
元素是 34 个字符,您可以测试 string length in a predicate:
data[string-length(translate(.,' ','')) > 34]
我在检查字符串长度之前使用 translate()
删除了所有空格。
示例...
XML 输入 (input.xml)
<logData>
<data>63653079886,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079887,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079888,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079889,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079889, , , , , , , , , , , , , , , , , , , , , , ,</data>
<data>63653079890,,29.3,155.8,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079891,,29.3,155.7,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079892,,29.3,155.8,12.25,0.0,0,0,93.76,-87.65,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
</logData>
Python(我使用 XMLParser() 使打印输出更好。这不是绝对必要的。)
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse("input.xml", parser=parser)
for data in tree.xpath("data[string-length(translate(.,' ','')) > 34]"):
print(etree.tostring(data).decode())
输出(打印到控制台)
<data>63653079890,,29.3,155.8,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079891,,29.3,155.7,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079892,,29.3,155.8,12.25,0.0,0,0,93.76,-87.65,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
如果您真的想测试 12.25
值,当它之前的值的字符串长度未知时,它在 XPath 1.0 谓词中有点混乱。您可以使用一系列 substring-afters()'s inside a substring-before() 来完成。虽然不好看...
xpath("data[substring-before(substring-after(substring-after(substring-after(substring-after(translate(.,' ',''),','),','),','),','),',') = '12.25']")