用 Python 和 lxml 和 etree 读写 HTML/XML （字节字符串）的正确方法是什么？

Question

编辑：既然问题已经解决，我意识到它更多地与正确的 reading/writing 字节字符串有关，而不是 HTML。希望这会让其他人更容易找到这个答案。

我有一个 HTML 格式不正确的文件。我想使用 Python 库来使其整洁。

看来应该就这么简单：

import sys
from lxml import etree, html

#read the unformatted HTML
with open('C:/Users/mhurley/Portable_Python/notebooks/View_Custom_Report.html', 'r', encoding='utf-8') as file:
    #write the pretty XML to a file
    file_text = ''.join(file.readlines())

#format the HTML
document_root = html.fromstring(file_text)
document = etree.tostring(document_root, pretty_print=True)

#write the nice, pretty, formatted HTML
with open('C:/Users/mhurley/Portable_Python/notebooks/Pretty.html', 'w') as file:
    #write the pretty XML to a file
    file.write(document)

但是这段代码抱怨说 file_lines 不是字符串或类似字节的对象。好的，我想函数不能使用列表是有道理的。

但是，它 'bytes' 不是字符串。没问题，str(document)

但随后我得到 HTML，其中充满了不是换行符的“\n”...它们是一个斜杠后跟一个 en。结果中没有实际的回车returns，只有一长行。

我尝试了一些其他奇怪的事情，比如指定编码、尝试解码等。None 产生了预期的结果。

读写这种（非 ASCII 是正确的术语吗？）文本的正确方法是什么？

Answer 1

您缺少从 etree 的 tostring 方法获取字节并且在将（字节串）写入文件时需要考虑到这一点。像这样在 open 函数中使用 b 开关，忘记 str() 转换：

with open('Pretty.html', 'wb') as file:
    #write the pretty XML to a file
    file.write(document)

附录

尽管这个答案解决了眼前的问题并教授了字节串， by Padraic Cunningham 是将 lxml etrees 写入文件的更干净、更快速的方法。

Answer 2

这可以在几行代码中使用 lxml 来完成，而无需使用 open，.write 方法正是您要执行的操作：

# parse using file name which is the also the recommended way.
tree = html.parse("C:/Users/mhurley/Portable_Python/notebooks/View_Custom_Report.html")
# call write on the tree
tree.write("C:/Users/mhurley/Portable_Python/notebooks/Pretty.html", pretty_print=True, encoding="utf=8")

另外file_text = ''.join(file.readlines())与file_text = file.read()

完全一样

用 Python 和 lxml 和 etree 读写 HTML/XML （字节字符串）的正确方法是什么？

What is the proper method for reading and writing HTML/XML (byte string) with Python and lxml and etree?

html

lxml

character-encoding

elementtree

python-3.x