如何在lxml中获取范围内的元素
How to get elements in range in lxml
我有一个 xml 类似于以下 xml。我正在尝试根据某个范围的属性 "id" 获取名称 "elem" 的元素。
例如:获取从id=4到id=8的所有"elem"个元素。
<all_levels>
<level1>
<level2>
<level3>
<elem id="1"> </elem>
<elem id="2"> </elem>
</level3>
<level3>
<elem id="3"> </elem>
<elem id="4"> </elem>
</level3>
</level2>
<level2>
<level3>
<elem id="5"> </elem>
<elem id="6"> </elem>
</level3>
<level3>
<elem id="7"> </elem>
<elem id="8"> </elem>
</level3>
</level2>
</level1>
<level1>
<level2>
<level3>
<elem id="9"> </elem>
<elem id="10"> </elem>
</level3>
<level3>
<elem id="11"> </elem>
<elem id="12"> </elem>
</level3>
</level2>
<level2>
<level3>
<elem id="13"> </elem>
<elem id="14"> </elem>
</level3>
<level3>
<elem id="15"> </elem>
<elem id="16"> </elem>
</level3>
</level2>
</level1>
</all_levels>
我试过两种方法:
1) 使用 xpath 获取所需的 "elem" 元素,例如
从范围 (4,8)
获取元素
from lxml import etree
sample_xml = etree.parse("sample_xml.xml")
elem1 = sample_xml.xpath("//word[@id = '%s']" % str(4))[0]
elem2 = sample_xml.xpath("//word[@id = '%s']" % str(5))[0]
elem3 = sample_xml.xpath("//word[@id = '%s']" % str(6))[0]
elem4 = sample_xml.xpath("//word[@id = '%s']" % str(7))[0]
elem5 = sample_xml.xpath("//word[@id = '%s']" % str(8))[0]
但如果范围很大,获取所有元素会花费太多时间。
2)使用xpath获取范围内的第一个元素,使用getnext()方法获取sibilings
from lxml import etree
sample_xml = etree.parse("sample_xml.xml")
elem1 = sample_xml.xpath("//word[@id = '%s']" % str(4))[0]
elems = [elem1]
curr_word = elem1
current_id = 4
while(current_id <= 8):
curr_elem = curr_word.getnext()
elems.append(curr_elem)
current_id += 1
但问题是 getnext() 只能获取同一棵树中的元素。所以它无法获取所有其他元素。
有没有比使用 xpath 更好的方法来获取范围内的元素?
似乎我们可以使用 xpath 有效地获取属性 "id" 属于特定范围的所有 "elem"。
下面是两种方法。我已经使用单元格魔术命令“%%time”来测量每种方法花费了多少时间。
from lxml import etree
sample_xml = etree.parse("sample_xml.xml")
方法一:
%%time
start_heading_id = 4
ending_heading_id = 1000
elem1 = sample_xml.xpath("//elem[@id = '%s']" % str(start_heading_id))[0]
elems = [elem1]
curr_word = elem1
current_id = start_heading_id
while(current_id <= ending_heading_id):
curr_elem = sample_xml.xpath("//elem[@id = '%s']" % str(current_id+1))[0]
elems.append(curr_elem)
current_id += 1
输出(用了13.2秒得到所有元素):
CPU times: user 13.2 s, sys: 23.6 ms, total: 13.2 s
Wall time: 13.2 s
方法二:
%%time
start_heading_id = 4
ending_heading_id = 1000
elems = sample_xml.xpath("//elem[@id >= '%d' and @id <= '%d']" % (start_heading_id,ending_heading_id))
输出(获取所有元素用了 0.00387 秒):
CPU times: user 39.2 ms, sys: 1.25 ms, total: 40.5 ms
Wall time: 38.7 ms
我有一个 xml 类似于以下 xml。我正在尝试根据某个范围的属性 "id" 获取名称 "elem" 的元素。
例如:获取从id=4到id=8的所有"elem"个元素。
<all_levels>
<level1>
<level2>
<level3>
<elem id="1"> </elem>
<elem id="2"> </elem>
</level3>
<level3>
<elem id="3"> </elem>
<elem id="4"> </elem>
</level3>
</level2>
<level2>
<level3>
<elem id="5"> </elem>
<elem id="6"> </elem>
</level3>
<level3>
<elem id="7"> </elem>
<elem id="8"> </elem>
</level3>
</level2>
</level1>
<level1>
<level2>
<level3>
<elem id="9"> </elem>
<elem id="10"> </elem>
</level3>
<level3>
<elem id="11"> </elem>
<elem id="12"> </elem>
</level3>
</level2>
<level2>
<level3>
<elem id="13"> </elem>
<elem id="14"> </elem>
</level3>
<level3>
<elem id="15"> </elem>
<elem id="16"> </elem>
</level3>
</level2>
</level1>
</all_levels>
我试过两种方法: 1) 使用 xpath 获取所需的 "elem" 元素,例如 从范围 (4,8)
获取元素from lxml import etree
sample_xml = etree.parse("sample_xml.xml")
elem1 = sample_xml.xpath("//word[@id = '%s']" % str(4))[0]
elem2 = sample_xml.xpath("//word[@id = '%s']" % str(5))[0]
elem3 = sample_xml.xpath("//word[@id = '%s']" % str(6))[0]
elem4 = sample_xml.xpath("//word[@id = '%s']" % str(7))[0]
elem5 = sample_xml.xpath("//word[@id = '%s']" % str(8))[0]
但如果范围很大,获取所有元素会花费太多时间。
2)使用xpath获取范围内的第一个元素,使用getnext()方法获取sibilings
from lxml import etree
sample_xml = etree.parse("sample_xml.xml")
elem1 = sample_xml.xpath("//word[@id = '%s']" % str(4))[0]
elems = [elem1]
curr_word = elem1
current_id = 4
while(current_id <= 8):
curr_elem = curr_word.getnext()
elems.append(curr_elem)
current_id += 1
但问题是 getnext() 只能获取同一棵树中的元素。所以它无法获取所有其他元素。
有没有比使用 xpath 更好的方法来获取范围内的元素?
似乎我们可以使用 xpath 有效地获取属性 "id" 属于特定范围的所有 "elem"。
下面是两种方法。我已经使用单元格魔术命令“%%time”来测量每种方法花费了多少时间。
from lxml import etree
sample_xml = etree.parse("sample_xml.xml")
方法一:
%%time
start_heading_id = 4
ending_heading_id = 1000
elem1 = sample_xml.xpath("//elem[@id = '%s']" % str(start_heading_id))[0]
elems = [elem1]
curr_word = elem1
current_id = start_heading_id
while(current_id <= ending_heading_id):
curr_elem = sample_xml.xpath("//elem[@id = '%s']" % str(current_id+1))[0]
elems.append(curr_elem)
current_id += 1
输出(用了13.2秒得到所有元素):
CPU times: user 13.2 s, sys: 23.6 ms, total: 13.2 s
Wall time: 13.2 s
方法二:
%%time
start_heading_id = 4
ending_heading_id = 1000
elems = sample_xml.xpath("//elem[@id >= '%d' and @id <= '%d']" % (start_heading_id,ending_heading_id))
输出(获取所有元素用了 0.00387 秒):
CPU times: user 39.2 ms, sys: 1.25 ms, total: 40.5 ms
Wall time: 38.7 ms