使用 Python3 进行网络抓取 - 忽略重复属性错误

Question

我想使用 Python 3 创建一个网络抓取应用程序。我试图抓取的网站包含无效的 xhtml - 因为它的标签具有重复的属性名称。

我想使用 xml.dom.minidom 来解析抓取的页面。由于属性名称重复，内容无法解析，出现以下错误：

Traceback (most recent call last):
  File "scraper.py", line 45, in <module>
    scraper.list()
  File "scraper.py", line 34, in list
    dom = parseString(response.text)
  File "C:\Python34\lib\xml\dom\minidom.py", line 1970, in parseString
    return expatbuilder.parseString(string)
  File "C:\Python34\lib\xml\dom\expatbuilder.py", line 925, in parseString
    return builder.parseString(string)
  File "C:\Python34\lib\xml\dom\expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: duplicate attribute: line 2, column 43

我想忽略这个错误并继续解析文档。我无法控制即将到来的 html 数据。我能做什么？

这是我的代码：

import requests
from xml.dom.minidom import parse, parseString


class Scraper:

    def __init__( self ):

        pass

    def list(self,pages=1):

        response = requests.get('http://example.com')

        dom = parseString(response.text)

        print(dom.toxml)


if __name__ == "__main__":

    scraper = Scraper()

    scraper.list()

Answer 1

有一个更好的方法：切换到BeautifulSoup HTML parser. It's quite good at parsing non well-formed or broken HTML and, depending on the underlying parser library, can be less or more lenient:

from bs4 import BeautifulSoup
import requests

response = requests.get(url).content
soup = BeautifulSoup(response, "html.parser")  # or use "html5lib", or "lxml"

使用 Python3 进行网络抓取 - 忽略重复属性错误

Webscraping with Python3 - Ignoring Duplicate Attribute Errors

python

html-parsing

xml-parsing

web-scraping

python-3.x