没有嵌套节点。如何获取一条信息,然后分别获取其他信息?
No nested nodes. How to get one piece of information and then to get additional info respectively?
对于下面的代码,我需要分别获取日期及其时间+hrefs+格式+...(未显示)。
<div class="showtimes">
<h2>The Little Prince</h2>
<div class="poster" data-poster-url="http://www.test.com">
<img src="http://www.test.com">
</div>
<div class="showstimes">
<div class="date">9 December, Wednesday</div>
<span class="show-time techno-3d">
<a href="http://www.test.com" class="link">12:30</a>
<span class="show-format">3D</span>
</span>
<span class="show-time techno-3d">
<a href="http://www.test.com" class="link">15:30</a>
<span class="show-format">3D</span>
</span>
<span class="show-time techno-3d">
<a href="http://www.test.com" class="link">18:30</a>
<span class="show-format">3D</span>
</span>
<div class="date">10 December, Thursday</div>
<span class="show-time techno-2d">
<a href="http://www.test.com" class="link">12:30</a>
<span class="show-format">2D</span>
</span>
<span class="show-time techno-3d">
<a href="http://www.test.com" class="link">15:30</a>
<span class="show-format">3D</span>
</span>
</div>
</div>
为此,我使用此代码 (python)。
for dates in movie.xpath('.//div[@class="showstimes"]/div[@class="date"]'):
date = dates.xpath('.//text()')[0]
# for times in dates.xpath('//following-sibling::span[1 = count(preceding-sibling::div[1] | (.//div[@class="date"])[1])]'):
# for times in dates.xpath('//following-sibling::span[contains(@class,"show-time")]'):
# for times in dates.xpath('.//../span[contains(@class,"show-time")]'):
# for times in dates.xpath('//following-sibling::span[preceding-sibling::div[1][.="date"]]'):
time = times.xpath('.//a/text()')[0]
url = times.xpath('.//a/@href')[0]
format_type = times.xpath('.//span[@class="show-format"]/text()')[0]
获取日期不是问题,但我有一个问题如何分别获取特定日期的其余信息。尝试了许多不同的方法 - 没有运气(在其中一些评论中)。当我需要的节点在另一个节点下(在同一级别?)时,我找不到处理这种情况的方法。在这种情况下:
-> div Date1
-> span Time1
-> span href1
-> span Format1
-> span Time2
-> span href2
-> span Format2
-> span Time3
-> span href3
-> span Format3
-> div Date2
-> span Time1
-> span href1
-> span Format1
# etc etc
事实证明 lxml
支持从 XPath 表达式引用 python 变量,这被证明对这种情况很有用,即对于每个 div date
,您可以获得以下同级 span
其中最近的前一个兄弟 div date
是当前 div date
,其中对当前 div date
的引用 存储在 python 变量中 dates
:
for dates in movie.xpath('.//div[@class="showstimes"]/div[@class="date"]'):
date = dates.xpath('normalize-space()')
for times in dates.xpath('following-sibling::span[preceding-sibling::div[1]=$current]', current=dates):
time = times.xpath('a/text()')[0]
url = times.xpath('a/@href')[0]
format_type = times.xpath('span/text()')[0]
print date, time, url, format_type
输出:
'9 December, Wednesday', '12:30', 'http://www.test.com', '3D'
'9 December, Wednesday', '15:30', 'http://www.test.com', '3D'
'9 December, Wednesday', '18:30', 'http://www.test.com', '3D'
'10 December, Thursday', '12:30', 'http://www.test.com', '2D'
'10 December, Thursday', '15:30', 'http://www.test.com', '3D'
参考资料:
对于下面的代码,我需要分别获取日期及其时间+hrefs+格式+...(未显示)。
<div class="showtimes">
<h2>The Little Prince</h2>
<div class="poster" data-poster-url="http://www.test.com">
<img src="http://www.test.com">
</div>
<div class="showstimes">
<div class="date">9 December, Wednesday</div>
<span class="show-time techno-3d">
<a href="http://www.test.com" class="link">12:30</a>
<span class="show-format">3D</span>
</span>
<span class="show-time techno-3d">
<a href="http://www.test.com" class="link">15:30</a>
<span class="show-format">3D</span>
</span>
<span class="show-time techno-3d">
<a href="http://www.test.com" class="link">18:30</a>
<span class="show-format">3D</span>
</span>
<div class="date">10 December, Thursday</div>
<span class="show-time techno-2d">
<a href="http://www.test.com" class="link">12:30</a>
<span class="show-format">2D</span>
</span>
<span class="show-time techno-3d">
<a href="http://www.test.com" class="link">15:30</a>
<span class="show-format">3D</span>
</span>
</div>
</div>
为此,我使用此代码 (python)。
for dates in movie.xpath('.//div[@class="showstimes"]/div[@class="date"]'):
date = dates.xpath('.//text()')[0]
# for times in dates.xpath('//following-sibling::span[1 = count(preceding-sibling::div[1] | (.//div[@class="date"])[1])]'):
# for times in dates.xpath('//following-sibling::span[contains(@class,"show-time")]'):
# for times in dates.xpath('.//../span[contains(@class,"show-time")]'):
# for times in dates.xpath('//following-sibling::span[preceding-sibling::div[1][.="date"]]'):
time = times.xpath('.//a/text()')[0]
url = times.xpath('.//a/@href')[0]
format_type = times.xpath('.//span[@class="show-format"]/text()')[0]
获取日期不是问题,但我有一个问题如何分别获取特定日期的其余信息。尝试了许多不同的方法 - 没有运气(在其中一些评论中)。当我需要的节点在另一个节点下(在同一级别?)时,我找不到处理这种情况的方法。在这种情况下:
-> div Date1
-> span Time1
-> span href1
-> span Format1
-> span Time2
-> span href2
-> span Format2
-> span Time3
-> span href3
-> span Format3
-> div Date2
-> span Time1
-> span href1
-> span Format1
# etc etc
事实证明 lxml
支持从 XPath 表达式引用 python 变量,这被证明对这种情况很有用,即对于每个 div date
,您可以获得以下同级 span
其中最近的前一个兄弟 div date
是当前 div date
,其中对当前 div date
的引用 存储在 python 变量中 dates
:
for dates in movie.xpath('.//div[@class="showstimes"]/div[@class="date"]'):
date = dates.xpath('normalize-space()')
for times in dates.xpath('following-sibling::span[preceding-sibling::div[1]=$current]', current=dates):
time = times.xpath('a/text()')[0]
url = times.xpath('a/@href')[0]
format_type = times.xpath('span/text()')[0]
print date, time, url, format_type
输出:
'9 December, Wednesday', '12:30', 'http://www.test.com', '3D'
'9 December, Wednesday', '15:30', 'http://www.test.com', '3D'
'9 December, Wednesday', '18:30', 'http://www.test.com', '3D'
'10 December, Thursday', '12:30', 'http://www.test.com', '2D'
'10 December, Thursday', '15:30', 'http://www.test.com', '3D'
参考资料: