为什么 html unescape 在这里不起作用？

Question

这是我获取数据的方式：

page = requests.get('some website')
data = bs4.BeautifulSoup(page.content,"lxml")

我正在使用它来进行转义：

from xml.sax.saxutils import unescape
html_escape_table = { '"':"&quot;", "'":"&apos;"}
html_unescape_table = {v:k for k,v in html_escape_table.items()}

def html_unescape(text):
    return unescape(text,html_unescape_table)

当我尝试在 data 的任何部分（我认为是一个字符串）上调用 unescape 时，它不会像它应该的那样进行转义。相反，它只是 returns 与我调用该函数的字符串相同（例如 \u00e8）。

然而，当我尝试调用 html_unescape() 时传递一个我实际键入的字符串（例如 html_unescape('\u00e8')，它起作用了。

为什么当我从 BeautifulSoup 得到的数据中传入一段字符串时它不起作用？

Answer 1

标准 Python 会打印 <type 'str'> 而不是 <class 'str'> -- 您一定收到了自定义 str class。您需要追踪它的来源（requests？BeautifulSoup？）并查看它支持哪些操作。

为什么 html unescape 在这里不起作用？

Why doesn't html unescape work here?

html

python

escaping