无法打开带有中文字符的 html 文件

Question

大家，我运行试图打开一个包含中文字符的HTML文件时遇到了麻烦，这是代码

#problem with chinese character
file =wget.download("http://nba.stats.qq.com/player/list.htm#teamId=1")
with open(file,encoding ='utf-8') as f:
    html = f.read()
    print(html)

但是在输出中我得到如下错误

    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 535: invalid continuation byte

我搜索了一段时间，看到了一些类似的问题，但是解决方案似乎使用了latin-1，显然这里不是这种情况，我不确定该使用哪种编码？

有什么建议吗？谢谢~

Answer 1

您所指的页面不是 UTF-8 编码，而是 GBK。您可以通过查看 header:

来判断

<meta charset="GBK">

如果您指定 encoding='gbk' 它将起作用。

另一方面，除非万不得已，否则我会选择不使用 wget，而是使用 Python 标准库附带的 urllib。也省去了写盘，代码更简单：

import urllib.request

with urllib.request.urlopen("http://nba.stats.qq.com/player/list.htm") as file:
    html = file.read()
    print(html.decode('gbk'))

无法打开带有中文字符的 html 文件

unable to open html file with Chinese character

encoding

utf-8

readfile

web-scraping

python-3.6