如何解析带有 <br /> 搞砸的 lxml 的 html 页面?
How to parse a htmlpage with lxml with <br /> screwing up?
我想用 python 中的 lxml 解析来自 Nasa 网站的以下 html 片段:
<p>
<strong>Launch Date:</strong>1981-09-24<br/>
<strong>Launch Vehicle:</strong> Delta<br/>
<strong>Launch Site:</strong> Cape Canaveral, United States<br/>
<strong>Mass:</strong> 550.0 kg<br/>
</p>
使用 python3 的以下代码:
from lxml.html import parse
page = parse("http://nssdc.gsfc.nasa.gov/nmc/spacecraftDisplay.do?id=1981-096A")
rows = page.xpath('//div[@class="urtwo"]/p')[0]
for element in rows:
print(element.xpath("string()"))
但是头部后面的值是空的...:[=14=]
Launch Date:
Launch Vehicle:
Launch Site:
Mass:
我认为它必须与 <'/strong> 或 <'br /> 相关。
谁能帮我找到解决办法?
如何迭代 strong
标签,将它们视为标签并将以下文本兄弟作为值:
rows = page.xpath('//div[@class="urtwo"]/p//strong')
for element in rows:
label = element.text.strip()
value = element.xpath("following-sibling::text()")[0].strip()
print(label, value)
打印:
('Launch Date:', u'1981-09-24')
(u'Launch\xa0Vehicle:', u'Delta')
(u'Launch\xa0Site:', u'Cape Canaveral, United States')
('Mass:', u'550.0\xa0kg')
我想用 python 中的 lxml 解析来自 Nasa 网站的以下 html 片段:
<p>
<strong>Launch Date:</strong>1981-09-24<br/>
<strong>Launch Vehicle:</strong> Delta<br/>
<strong>Launch Site:</strong> Cape Canaveral, United States<br/>
<strong>Mass:</strong> 550.0 kg<br/>
</p>
使用 python3 的以下代码:
from lxml.html import parse
page = parse("http://nssdc.gsfc.nasa.gov/nmc/spacecraftDisplay.do?id=1981-096A")
rows = page.xpath('//div[@class="urtwo"]/p')[0]
for element in rows:
print(element.xpath("string()"))
但是头部后面的值是空的...:[=14=]
Launch Date:
Launch Vehicle:
Launch Site:
Mass:
我认为它必须与 <'/strong> 或 <'br /> 相关。
谁能帮我找到解决办法?
如何迭代 strong
标签,将它们视为标签并将以下文本兄弟作为值:
rows = page.xpath('//div[@class="urtwo"]/p//strong')
for element in rows:
label = element.text.strip()
value = element.xpath("following-sibling::text()")[0].strip()
print(label, value)
打印:
('Launch Date:', u'1981-09-24')
(u'Launch\xa0Vehicle:', u'Delta')
(u'Launch\xa0Site:', u'Cape Canaveral, United States')
('Mass:', u'550.0\xa0kg')