Python 未显示网络抓取希腊字母

Question

我正在尝试学习如何使用 python3 自动执行任务。现在，我正在尝试打开一个网站，从中获取一个元素，然后使用 requests、docx 和 bs4 模块将其文本变成一个词 sheet 作为一个新段落。所有这些工作正常，但该网站包含一些希腊字母。当我尝试打开单词 sheet 时，数字等没问题，但希腊字母显示错误（它们都显示为 Öéëïá 等）。我怎么解决这个问题？？这是我的代码：

import requests, docx, bs4
doc = docx.Document()
res=requests.get(“http://www.betcosmos.com/index.php?page=kouponi_stoixima”)
soup =bs4.BeautifulSoup(res.text, “lxml”)
elem =soup.select(“.kouponi_table”)
doc.add_paragraph(elem[0].getText())
doc.save(“BetMasterData.docx”)

提前感谢您的宝贵时间

Answer 1

阅读我们遇到的关于响应内容的请求文档。 Requests 2.18.4 Documentation - Response Content

回复内容

We can read the content of the server’s response. Consider the GitHub timeline again:

import requests

r = requests.get('https://api.github.com/events')

r.text u'[{"repository":{"open_issues":0,"url":"https://github.com/...

请求将自动解码来自服务器的内容。大多数 unicode 字符集都是无缝解码的。

当您发出请求时，Requests 会根据 HTTP headers 对响应的编码进行有根据的猜测。当您访问 r.text 时，将使用 Requests 猜测的文本编码。您可以使用 r.encoding 属性:
找出 Requests 使用的编码并更改它

r.encoding 'utf-8' r.encoding = 'ISO-8859-1'

如果您更改编码，Requests 将在您调用 r.text 时使用 r.encoding 的新值。在任何可以应用特殊逻辑来计算内容编码的情况下，您可能都希望这样做。例如，HTML 和 XML 可以在 body 中指定它们的编码。在这种情况下，您应该使用 r.content 来查找编码，然后设置 r.encoding。这将使您可以使用 r.text 和正确的编码。

如果您需要，请求也将使用自定义编码。如果您已经创建了自己的编码并将其注册到编解码器模块，则只需使用编解码器名称作为 r.encoding 的值，Requests 将为您处理解码。

二进制响应内容

You can also access the response body as bytes, for non-text requests:

r.content b'[{"repository":{"open_issues":0,"url":"https://github.com/...

gzip 和 deflate transfer-encodings 会自动为您解码。

试试这个：

import requests, docx, bs4

doc = docx.Document()
res = requests.get('http://www.betcosmos.com/index.php?page=kouponi_stoixima')
soup = bs4.BeautifulSoup(res.content, 'lxml')
elem = soup.select('.kouponi_table') 
doc.add_paragraph(elem[0].getText())
doc.save('BetMasterData.docx')`

Python 未显示网络抓取希腊字母

Python web scraping greek letters not shown

web-scraping

python-3.5