Python 爬虫：正在下载 HTML 页面

Question

我想（缓慢地）抓取网站并下载我抓取的每个 HTML 页面。为此，我使用了库请求。我已经完成了我的爬网列表，我尝试使用 urllib.open 来抓取它们，但没有用户代理，我收到一条错误消息。所以我选择使用requests，但是我真的不知道怎么用。

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'
}
page = requests.get('http://www.xf.com/ranking/get/?Amount=1&From=left&To=right', headers=headers)
with open('pages/test.html', 'w') as outfile:
     outfile.write(page.text)

问题是当脚本尝试在我的文件中写入响应时出现一些编码错误：

UnicodeEncodeError: 'ascii' codec can't encode characters in position 6673-6675: ordinal not in range(128)

我们如何在没有编码问题的情况下写入文件？

Answer 1

outfile.write(page.text.encode('utf8', 'replace'))

我在这里找到了文档：unicode problem

Answer 2

在Python 2中，文本文件不接受Unicode字符串。使用 response.content 访问原始二进制文件，未解码的内容：

with open('pages/test.html', 'w') as outfile:
    outfile.write(page.content)

这将以网站提供的原始编码写入下载的 HTML。

或者，如果您想将所有响应重新编码为特定编码，请使用 io.open() 生成接受 Unicode 的文件对象：

import io

with io.open('pages/test.html', 'w', encoding='utf8') as outfile:
    outfile.write(page.text)

请注意，许多网站依赖于在 HTML 标签 中发出正确的编解码器信号，并且可以在完全没有字符集参数的情况下提供内容。

在这种情况下，requests 使用默认编解码器 text/* mimetype，Latin-1，将 HTML 解码为 Unicode 文本. 这通常是错误的编解码器，依赖此行为可能会导致 Mojibake 稍后输出。我建议您坚持编写二进制内容并依靠 BeautifulSoup 等工具稍后检测正确的编码。

或者，显式测试 charset 参数是否存在，并且仅在 requests 没有回退时才重新编码（通过 response.text 和 io.open() 或其他方式）到 Latin-1 默认值。请参阅 retrieve links from web page using python and BeautifulSoup 以获取答案，其中我使用这种方法告诉 BeautifulSoup 要使用的编解码器。

Python 爬虫：正在下载 HTML 页面

Python crawler: downloading HTML page

html

python

web-crawler

python-requests