浏览器和 HTMLPaser 实体解析区别

Question

我从网页的 HTML 中得到以下字符 。它在网页上呈现为“”（EM DASH）（如果重要，通过 Google Chrome 浏览器）。无论我如何使用文件编码（"utf-8"、"cp1251"、"cp866"），它在网页上始终是“”。但是当我运行以下 python 代码时：

from HTMLParser import HTMLParser

h_parser = HTMLParser()
print h_parser.unescape('&#151;')

它输出一些控制符号，在 unicode table 中是 "End of Guarded Area"。

我应该使用什么 python 代码从  string/unicode 字符串中获取“”。我用 python2.7

Answer 1

字符引用中的数值（在您的情况下为 151）指的是 Unicode 代码点 151 (0x97)，它在 Latin-1 Supplement 中代表控制字符。

很可能使用了无效值 151，因为它对应于 Windows 代码页 1252 中的破折号字符。浏览器将其呈现为长破折号可能是为了处理此常见错误。

长破折号的正确字符参考是 &#8212。

>>> import unicodedata
>>> unicodedata.name(unichr(151))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name

>>> unicodedata.lookup('em dash')
u'\u2014'
>>> unicodedata.lookup('em dash').encode('cp1252')
'\x97'

虽然 Python 2 正在为此苦苦挣扎，但在 Python 3 中 html.unescape() function explicitly handles invalid character references as specified in the HTML 5 spec。如果可能，您可以使用 Python 3 来解决您的问题：

>>> from html import unescape
>>> unescape('&#151;')
'—'

如果您不能使用 Python 3，您可以从 Python 3 html 模块复制代码（参见 __init__.py 文件）并传递 HTML 在交给 HTMLParser.

之前通过它编码

浏览器和 HTMLPaser 实体解析区别

browser and HTMLPaser entities parsing difference

html

python

html-entities

python-2.7