使用 python 解析 xpath
Parsing xpath with python
我正在尝试解析包含以下内容的网页:
<table style="width: 100%; border-top: 1px solid black; border-bottom: 1px solid black;">
<tr>
<td colspan="2"
style="border-top: 1px solid black; border-bottom: 1px solid black; background-color: #f0ffd3;">February 20, 2015</td>
</tr>
<tr>
<td style="border-top: 1px solid gray; font-weight: bold;">9:00 PM</td>
<td style="border-top: 1px solid gray; font-weight: bold">14°F</td>
</tr>
<tr>
<td style="border-bottom: 1px solid gray;">Clear<br />
Precip:
0 %<br />
Wind:
from the WSW at 6 mph
</td>
<td style="border-bottom: 1px solid gray;"><img class="wxicon" src="http://i.imwx.com/web/common/wxicons/31/31.gif"
style="border: 0px; padding: 0px 3px" /></td>
</tr>
<tr>
<td style="border-top: 1px solid gray; font-weight: bold;">10:00 PM</td>
<td style="border-top: 1px solid gray; font-weight: bold">13°F</td>
</tr>
<tr>
<td style="border-bottom: 1px solid gray;">Clear<br />
Precip:
0 %<br />
Wind:
from the WSW at 6 mph
</td>
<td style="border-bottom: 1px solid gray;"><img class="wxicon" src="http://i.imwx.com/web/common/wxicons/31/31.gif"
style="border: 0px; padding: 0px 3px" /></td>
</tr>
(它以更多行继续并以 [/table]
结束
tree = html.fromstring(page)
table = tree.xpath('//table/tr')
for item in table:
for elem in item.xpath('*'):
if 'colspan' in html.tostring(elem):
print '*', elem.text
elif elem.text is not None:
print elem.text,
else:
print
有点效果。它没有得到 [br /] 之后的文本,而且它远非优雅。我如何获得丢失的文本?此外,如有任何改进代码的建议,我们将不胜感激。
使用 .text_content()
怎么样?
.text_content(): Returns the text content of the element, including the text content of
its children, with no markup.
table = tree.xpath('//table/tr')
for item in table:
print ' '.join(item.text_content().split())
join()
+split()
此处帮助将多个空格替换为一个空格。
它打印:
February 20, 2015
9:00 PM 14°F
Clear Precip: 0 % Wind: from the WSW at 6 mph
10:00 PM 13°F
Clear Precip: 0 % Wind: from the WSW at 6 mph
由于您想将时间线与事件线合并,您可以遍历 tr
标签,但跳过文本中包含 Precip
的标签。对于每个时间线,获取以下 tr sibling 以获取 precip-line:
table = tree.xpath('//table/tr[not(contains(., "Precip"))]')
for item in table:
text = ' '.join(item.text_content().split())
if 'AM' in text or 'PM' in text:
text += ' ' + ' '.join(item.xpath('following-sibling::tr')[0].text_content().split())
print text
打印:
February 20, 2015
9:00 PM 14°F Clear Precip: 0 % Wind: from the WSW at 6 mph
10:00 PM 13°F Clear Precip: 0 % Wind: from the WSW at 6 mph
我正在尝试解析包含以下内容的网页:
<table style="width: 100%; border-top: 1px solid black; border-bottom: 1px solid black;">
<tr>
<td colspan="2"
style="border-top: 1px solid black; border-bottom: 1px solid black; background-color: #f0ffd3;">February 20, 2015</td>
</tr>
<tr>
<td style="border-top: 1px solid gray; font-weight: bold;">9:00 PM</td>
<td style="border-top: 1px solid gray; font-weight: bold">14°F</td>
</tr>
<tr>
<td style="border-bottom: 1px solid gray;">Clear<br />
Precip:
0 %<br />
Wind:
from the WSW at 6 mph
</td>
<td style="border-bottom: 1px solid gray;"><img class="wxicon" src="http://i.imwx.com/web/common/wxicons/31/31.gif"
style="border: 0px; padding: 0px 3px" /></td>
</tr>
<tr>
<td style="border-top: 1px solid gray; font-weight: bold;">10:00 PM</td>
<td style="border-top: 1px solid gray; font-weight: bold">13°F</td>
</tr>
<tr>
<td style="border-bottom: 1px solid gray;">Clear<br />
Precip:
0 %<br />
Wind:
from the WSW at 6 mph
</td>
<td style="border-bottom: 1px solid gray;"><img class="wxicon" src="http://i.imwx.com/web/common/wxicons/31/31.gif"
style="border: 0px; padding: 0px 3px" /></td>
</tr>
(它以更多行继续并以 [/table]
结束tree = html.fromstring(page)
table = tree.xpath('//table/tr')
for item in table:
for elem in item.xpath('*'):
if 'colspan' in html.tostring(elem):
print '*', elem.text
elif elem.text is not None:
print elem.text,
else:
print
有点效果。它没有得到 [br /] 之后的文本,而且它远非优雅。我如何获得丢失的文本?此外,如有任何改进代码的建议,我们将不胜感激。
使用 .text_content()
怎么样?
.text_content(): Returns the text content of the element, including the text content of its children, with no markup.
table = tree.xpath('//table/tr')
for item in table:
print ' '.join(item.text_content().split())
join()
+split()
此处帮助将多个空格替换为一个空格。
它打印:
February 20, 2015
9:00 PM 14°F
Clear Precip: 0 % Wind: from the WSW at 6 mph
10:00 PM 13°F
Clear Precip: 0 % Wind: from the WSW at 6 mph
由于您想将时间线与事件线合并,您可以遍历 tr
标签,但跳过文本中包含 Precip
的标签。对于每个时间线,获取以下 tr sibling 以获取 precip-line:
table = tree.xpath('//table/tr[not(contains(., "Precip"))]')
for item in table:
text = ' '.join(item.text_content().split())
if 'AM' in text or 'PM' in text:
text += ' ' + ' '.join(item.xpath('following-sibling::tr')[0].text_content().split())
print text
打印:
February 20, 2015
9:00 PM 14°F Clear Precip: 0 % Wind: from the WSW at 6 mph
10:00 PM 13°F Clear Precip: 0 % Wind: from the WSW at 6 mph