Python ElementTree 生成格式不正确的 XML 具有特殊字符 '\x0b' 的文件

Python ElementTree generate not well formed XML file with special character '\x0b'

我用ElementTree生成了特殊字符'\x0b'的xml,然后用minidom解析了。它会抛出 not well-formed 错误。

import xml.etree.ElementTree as ET
from xml.dom import minidom
root = ET.Element('root')
root.text='\x0b'
xml = ET.tostring(root, 'UTF-8')
print(xml)
pretty_tree = minidom.parseString(xml)

生成XML<root>\x0b</root>

错误

Traceback (most recent call last):
  File "testXml.py", line 7, in <module>
    pretty_tree = minidom.parseString(xml)
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/minidom.py", line 1968, in parseString
    return expatbuilder.parseString(string)
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/expatbuilder.py", line 925, in parseString
    return builder.parseString(string)
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 6

\x0b 是 XML 受限字符。 this question.

的答案中对有效字符和受限字符有很好的描述

此行为已在过去作为 bug 提出并解决为“不会修复”。

ElementTree 模块的作者commented

For ET, [this behaviour is] very much on purpose. Validating data provided by every single application would kill performance for all of them, even if only a small minority would ever try to serialize data that cannot be represented in XML.

收尾 comment (by the maintainer of lxml,他也是 Python 核心开发人员)包括以下观察结果:

This is a tricky decision. lxml, for example, validates user input, but that's because it has to process it anyway and does it along the way directly on input (and very efficiently in C code). ET, on the other hand, is rather lenient about what it allows users to do and doesn't apply much processing to user input. It even allows invalid trees during processing and only expects the tree to be serialisable when requested to serialise it.

I think that's a fair behaviour, because most user input will be ok and shouldn't need to suffer the performance penalty of validating all input. Null-characters are a very rare thing to find in text, for example, and I think it's reasonable to let users handle the few cases by themselves where they can occur.

...

In the end, users who really care about correct output should run some kind of schema validation over it after serialisation, as that would detect not only data issues but also structural and logical issues (such as a missing or empty attribute), specifically for their target data format. In some cases, it might even detect random data corruption due to old non-ECC RAM in the server machine. :)

...

所以总而言之,ET.tostring 将生成格式不正确的 xml,这是设计使然。如有必要,可以使用 ET.fromstring 或其他解析器解析输出以检查其格式是否正确。或者,可以使用 lxml 代替 ElementTree。

作为我自己的解决方法,我编写了一个辅助方法来在保存到 XML 模型之前清除受限字符:

def clean(str):
  return re.sub(r'[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\u10000-\u10FFF]+', '', str)