HTML 解析器:将 html ISO-8859-1 编码文本转换为 UTF-8
HTML parser: convert html ISO-8859-1 encoded text to UTF-8
我正在尝试使用 Beautiful Soup 在 python 3.7 中构建一个 html 邮件解析器。
邮件header中的Content-Type是:text/html;字符集="iso-8859-1"
这是一些 html 代码:
<div dir='3D"ltr"' id='3D"divRplyFwdMsg"'>
<font color='3D"#000000"' face='=3D"Calibri,' sans-serif"="" style='3D"font-size:11pt"'>
<b>
Enviado:
</b>
jueves, 9 de mayo de 2019 11:16
<br/>
<b>
Para:
</b>
DealReg
<br/>
<b>
Asunto:
</b>
Integrated Quoting - Deal Registration ID 001009814954 pa=
ra Cliente client_name Revisi=F3n completa
</font>
<div>
</div>
我需要使用 UTF-8 正确编码文本。
哪里是“集成报价 - 交易注册 ID 001009814954 pa=
ra Cliente client_name Revisi=F3n completa" 我希望 "Integrated Quoting - Deal Registration ID 001009814954 para Cliente client_name Revisión completa"
我找到了一些解决方案,但 none 对我有用:
[1].
with codecs.open(html_path,"r", encoding = "utf-8") as html_file:
text = html_file.read()
[2].
with io.open(html_path,"r", encoding = "utf-8") as html_file:
text = html_file.read()
[3].
a = "Revisi=F3n"
b = a.encode("iso-8859-1").decode("utf-8")
>>>print(b)
"Revisi=F3n"
在[3]中我也尝试用ascii、latin-1、cp1252编码,结果是一样的。
谢谢!
看起来非 ascii 字符已使用 quoted printable encoding (perhaps this html is from an email?). The quopri 模块编码,可用于将它们编码为 bytes
,然后可以将其解码为 str
。
>>> import quopri
>>> s = 'Revisi=F3n'
>>> quopri.decodestring(s)
b'Revisi\xf3n' # bytes
>>> quopri.decodestring(s).decode('ISO-8859-1')
'Revisión'
quopri.decode 函数将解码整个文件。
我正在尝试使用 Beautiful Soup 在 python 3.7 中构建一个 html 邮件解析器。
邮件header中的Content-Type是:text/html;字符集="iso-8859-1"
这是一些 html 代码:
<div dir='3D"ltr"' id='3D"divRplyFwdMsg"'>
<font color='3D"#000000"' face='=3D"Calibri,' sans-serif"="" style='3D"font-size:11pt"'>
<b>
Enviado:
</b>
jueves, 9 de mayo de 2019 11:16
<br/>
<b>
Para:
</b>
DealReg
<br/>
<b>
Asunto:
</b>
Integrated Quoting - Deal Registration ID 001009814954 pa=
ra Cliente client_name Revisi=F3n completa
</font>
<div>
</div>
我需要使用 UTF-8 正确编码文本。
哪里是“集成报价 - 交易注册 ID 001009814954 pa= ra Cliente client_name Revisi=F3n completa" 我希望 "Integrated Quoting - Deal Registration ID 001009814954 para Cliente client_name Revisión completa"
我找到了一些解决方案,但 none 对我有用:
[1].
with codecs.open(html_path,"r", encoding = "utf-8") as html_file:
text = html_file.read()
[2].
with io.open(html_path,"r", encoding = "utf-8") as html_file:
text = html_file.read()
[3].
a = "Revisi=F3n"
b = a.encode("iso-8859-1").decode("utf-8")
>>>print(b)
"Revisi=F3n"
在[3]中我也尝试用ascii、latin-1、cp1252编码,结果是一样的。
谢谢!
看起来非 ascii 字符已使用 quoted printable encoding (perhaps this html is from an email?). The quopri 模块编码,可用于将它们编码为 bytes
,然后可以将其解码为 str
。
>>> import quopri
>>> s = 'Revisi=F3n'
>>> quopri.decodestring(s)
b'Revisi\xf3n' # bytes
>>> quopri.decodestring(s).decode('ISO-8859-1')
'Revisión'
quopri.decode 函数将解码整个文件。