将 HTML 个实体解码为 Unicode

Question

好吧，从昨天开始我就遇到了这个问题。我需要将一些文本保存到“.txt”文件中，问题是我要保存的文本中有 html 个实体。

所以我在代码中导入了 HTMLPaser：

import HTMLParser
h = HTMLParser.HTMLParser()
print h.unescape(text) // right?

问题是，当您尝试打印结果时，这会起作用，但我正在尝试 return 这是我的一个函数，它实际上将文本保存到文件中。所以，当我尝试保存文件时，系统显示：

exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xab' in position 0: ordinal not in range(128)

我一直在阅读这个，但我无法得出任何结论，我尝试了 BeautifulSoup，我尝试了著名 pythonists 的函数并且 none 成功了。你能帮我吗？我需要将文件中的文本保存为 unicode，通过 unicode 我知道它会保存如下字符：á，对吧？

Answer 1

"Save Unicode character to a file" 与 "Decoding HTML Entities to Unicode" 是不同的问题。您的代码 (h.unescape(text)) 已经正确解码了 html 文本。

异常是由于 print unicode_text 例如：

print u"\N{EURO SIGN}"

应该会产生类似的错误。

如果您通过重定向 python 脚本的输出来保存到文件，例如：

$ python -m your_module >output.txt #XXX raises an error for non-ascii data

然后定义PYTHONIOENCODING=utf-8 envvar（使用utf-8编码保存）：

$ PYTHONIOENCODING=utf-8 python -m your_module >output.txt

如果您想直接在 Python 代码中保存到文件，请使用 io 模块：

import io

with io.open(filename, 'w', encoding='utf-8') as file:
    file.write(h.unescape(text))

Decoding HTML Entities to Unicode