使用 urllib 时 etree 生成错误

etree generating error when using urlib

我正在尝试使用 the solutions in this post 将 HTML table 解析为 python (2.7)。 当我用字符串尝试前两个中的任何一个时(如示例中所示),它工作得很好。 但是,当我尝试在 HTML 页面上使用 etree.xml 时,我使用 urlib 进行读取,但出现错误。我对每个解决方案都进行了检查,我传递的变量也是一个 str 。 对于以下代码:

from lxml import etree
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = etree.XML(s)

我收到这个错误:

File "C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line 9, in table = etree.XML(s)

File "lxml.etree.pyx", line 2723, in lxml.etree.XML (src/lxml/lxml.etree.c:52448)

File "parser.pxi", line 1573, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79932)

File "parser.pxi", line 1452, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:78774)

File "parser.pxi", line 960, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75389)

File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739)

File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)

File "parser.pxi", line 585, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71955) lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: link line 8 and head, line 8, column 48

对于此代码:

from xml.etree import ElementTree as ET
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = ET.XML(s)

我收到这个错误:

Traceback (most recent call last): File "C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line 6, in table = ET.XML(s)

File "C:\Python27\lib\xml\etree\ElementTree.py", line 1300, in XML parser.feed(text)

File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed self._raiseerror(v)

File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror raise err xml.etree.ElementTree.ParseError: mismatched tag: line 8, column 111

虽然它们看起来可能是相同的标记类型,但 HTML 并不像 XML 那样严格,需要格式正确并遵循标记规则(opening/closing 节点、转义实体等.).因此,HTML 通过的内容可能不允许 XML.

因此,考虑使用etree的HTML()功能来解析页面。此外,您可以使用 XPath 来定位您打算提取或使用的特定区域。下面是一个试图拉取主页 table 的例子。请注意该网页使用了相当多的嵌套 tables.

from lxml import etree
import urllib.request as rq
yearurl = "http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s = rq.urlopen(yearurl).read()
print(type(s))

# PARSE PAGE
htmlpage = etree.HTML(s)

# XPATH TO SPECIFIC CONTENT
htmltable = htmlpage.xpath("//table[tr/td/font/a/b='Rank']//text()")

for row in htmltable:
    print(row)