如何改造&#xxx;字符到他们的正常表示？

Question

我得到了未处理的 &#xxx; 个字符，我想将其转换回原始字符。

让我们执行一个不执行任何操作的简单 XSL 转换（输出 = 输入）使用俄语字符:

input.xml 是：

<root>Здраве</root>

transform.xsl 是：

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="xml" indent="yes"/>
  <xsl:template match="node()|@*">
    <xsl:copy>
      <xsl:apply-templates select="node()|@*"/>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

这是我的 python 代码：

import lxml.etree as ET

dom = ET.parse("input.xml")
xslt = ET.parse("transform.xsl")
transform = ET.XSLT(xslt)
newdom = transform(dom)

print(ET.tostring(newdom, pretty_print=True))

输出为：

b'<root>&#1047;&#1076;&#1088;&#1072;&#1074;&#1077;</root>\n'

演示：https://repl.it/join/lktibwya-vincentandrieu

我的问题是我需要将其保存到 Здраве 而不是 Здраве

的文件中

如何将特殊字符转换为它们的正常表示形式？

Answer 1

您可以使用 html 模块：

html.unescape('<root>&#1047;&#1076;&#1088;&#1072;&#1074;&#1077;</root>\n')

'<root>Здраве</root>\n'

如果您要接收字节，则需要先将它们转换为字符串：

b = b'<root>&#1047;&#1076;&#1088;&#1072;&#1074;&#1077;</root>\n'

html.unescape(b.decode('utf-8'))

'<root>Здраве</root>\n'

您也可以尝试在 ET.tostring 的调用中使用 encoding='unicode'。它应该直接 return 一个 Python 字符串，因为 Python 在内部对字符串使用 unicode:

 print(ET.tostring(newdom, encoding='unicode', pretty_print=True))

Answer 2

这里的 р 1088 基本上是 Unicode 代码点。在 python 中，您可以通过 chr(integer value of Unicode code point).

将 Unicode 代码点转换为实际表示

另外前导b'<root...表示是binary。所以我们需要用.decode()转换成string.

最后，我们可以使用 regular expression 来获取 Unicode 代码点： &#(\d{4});
&# ：将匹配以 &#
开头 ( ) : 捕获一组
\d{4} : 选择长度为 4
的数字 ; : 以 ;

结尾

import re

a = b'<root>&#1047;&#1076;&#1088;&#1072;&#1074;&#1077;</root>\n'
''.join([chr(int(i)) for i in re.findall(r'&#(\d{4});', a.decode())])

Здраве

如何改造&#xxx;字符到他们的正常表示？

How to transform &#xxx; characters to their normal representation?

python

xslt

python-2.7

python-3.x