如何使用utf-8编码制作lxml输出文件

How to make lxml output file with utf-8 encoding

data.xml

<?xml version="1.0" encoding="UTF-8"?>
<ArticleSet>
    <Article>            
        <LastName>Bojarski</LastName>
        <ForeName>-</ForeName>
        <Affiliation>-</Affiliation>            
    </Article>
    <Article>            
        <LastName>Genç</LastName>
        <ForeName>Yasemin</ForeName>
        <Affiliation>fgjfgnfgn</Affiliation>            
    </Article>
</ArticleSet>

示例代码

from lxml import etree

dom = etree.parse('data.xml')
root = dom.getroot()

for article in dom.xpath('Article[Affiliation="-"]'):
    root.remove(article)

dom.write('output.xml')

此代码会删除隶属关系等于 - 即其隶属关系标签看起来像 <Affliation>-</Affliation> 的文章 当我将剩余的输出存储到 output.xml 中时,它将 Unicode 字符 Genç 解析为 Gen&#231; 我想按原样存储它。

代码的输出

<ArticleSet>
    <Article>            
        <LastName>Gen&#231;</LastName>
        <ForeName>Yasemin</ForeName>
        <Affiliation>fgjfgnfgn</Affiliation>            
    </Article>
</ArticleSet>

需要输出

<ArticleSet>
    <Article>            
        <LastName>Genç</LastName>
        <ForeName>Yasemin</ForeName>
        <Affiliation>fgjfgnfgn</Affiliation>            
    </Article>
</ArticleSet>

etree.write方法中有encoding参数。您也可以使用 xml_declaration=True 声明输出文档的编码。

dom.write('output.xml', encoding='utf-8', xml_declaration=True)

参见 lxml documentation