HTML 解析器：将 html ISO-8859-1 编码文本转换为 UTF-8

Question

我正在尝试使用 Beautiful Soup 在 python 3.7 中构建一个 html 邮件解析器。

邮件header中的Content-Type是：text/html；字符集="iso-8859-1"

这是一些 html 代码：

<div dir='3D"ltr"' id='3D"divRplyFwdMsg"'>
         <font color='3D"#000000"' face='=3D"Calibri,' sans-serif"="" style='3D"font-size:11pt"'>
          <b>
           Enviado:
          </b>
          jueves, 9 de mayo de 2019 11:16
          <br/>
          <b>
           Para:
          </b>
          DealReg
          <br/>
          <b>
           Asunto:
          </b>
          Integrated Quoting - Deal Registration ID 001009814954 pa=
ra Cliente client_name Revisi=F3n completa
         </font>
         <div>
         </div>

我需要使用 UTF-8 正确编码文本。

哪里是“集成报价 - 交易注册 ID 001009814954 pa= ra Cliente client_name Revisi=F3n completa" 我希望 "Integrated Quoting - Deal Registration ID 001009814954 para Cliente client_name Revisión completa"

我找到了一些解决方案，但 none 对我有用：

[1].

with codecs.open(html_path,"r", encoding = "utf-8") as html_file:
           text = html_file.read()

[2].

with io.open(html_path,"r", encoding = "utf-8") as html_file:
           text = html_file.read()

[3].

a = "Revisi=F3n"
b = a.encode("iso-8859-1").decode("utf-8")

>>>print(b)
"Revisi=F3n"

在[3]中我也尝试用ascii、latin-1、cp1252编码，结果是一样的。

谢谢！

Answer 1

看起来非 ascii 字符已使用 quoted printable encoding (perhaps this html is from an email?). The quopri 模块编码，可用于将它们编码为 bytes，然后可以将其解码为 str。

>>> import quopri
>>> s = 'Revisi=F3n'      
>>> quopri.decodestring(s)
b'Revisi\xf3n'   # bytes
>>> quopri.decodestring(s).decode('ISO-8859-1')
'Revisión'

quopri.decode 函数将解码整个文件。

HTML 解析器：将 html ISO-8859-1 编码文本转换为 UTF-8

HTML parser: convert html ISO-8859-1 encoded text to UTF-8

html

python

encoding

utf-8

html-parsing