在 xml 节点内容中用 lxml 换行
Word wrapping in xml node content with lxml
我正在使用 lxml 写一个 XML 文件,在其中一个节点中,要写的内容是一个很长的字符串。
我正在寻找一种将这些字符串包装在 XML 节点中的方法。
目前,我尝试如下:
from lxml import etree
def lines_lenght(string, width):
words = string.split()
for i in range(0, len(words), width):
yield " ".join(words[i:i+width])
s = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec in enim at arcu tincidunt tristique. Ut commodo dui hendrerit lobortis egestas. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Sed laoreet interdum enim ut cursus. Fusce condimentum dictum dictum. Morbi feugiat bibendum enim, ut mollis turpis tincidunt vitae. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce libero ante, consectetur at sollicitudin at, eleifend lacinia ipsum. In hac habitasse platea dictumst. Sed laoreet mi eu nisi condimentum, sit amet vestibulum purus elementum. Nam a eros mi.
"""
root = etree.Element("corpus")
doc = etree.ElementTree(root)
article_node = etree.SubElement(root, "article")
final_content = "\n".join(lines_lenght(s, 10))
article_node.text = final_content
doc.write("corpus.xml", xml_declaration=True, encoding="utf-8")
但是在生成的XML文件中,换行符似乎没有被保留。根据,我尝试用

代替\n,但结果是一样的。
有什么提示可以帮助我吗?
编辑:这是我尝试实现的预览:
<corpus>
<article>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec in
enim at arcu tincidunt tristique. Ut commodo dui hendrerit lobortis
egestas. Orci varius natoque penatibus et magnis dis parturient montes</article>
</corpus>
而不是:
<corpus>
<article>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec in enim at arcu tincidunt tristique. Ut commodo dui hendrerit lobortis egestas. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.</article>
</corpus>
好吧,到达那里花了一段时间,在路上我不得不寻求 this answer here 的帮助,并搬出 lxml(正如其他人所说,是一个很棒的库,但有很多限制),到 python 内置。
Is 像你一样开始,但在 article_node.text = final_content
之后(doc.write()
之前)立即停止。并从上面链接的答案中添加:
def indent(elem, level=0):
i = "\n" + level*" "
if len(elem):
#print(len(elem))
if not elem.text or not elem.text.strip():
elem.text = i + " "
if not elem.tail or not elem.tail.strip():
elem.tail = i
for elem in elem:
indent(elem, level+1)
if not elem.tail or not elem.tail.strip():
elem.tail = i
else:
if level and (not elem.tail or not elem.tail.strip()):
elem.tail = i
然后是:
import xml.etree.ElementTree as ET
root2 = ET.fromstring(etree.tostring(doc))
tree = ET.ElementTree(root2)
indent(root2)
tree.write("corpus.xml", encoding="utf-8", xml_declaration=True)
测试一下:
with open("corpus.xml") as f:
print(f.read())
输出:
<?xml version='1.0' encoding='utf-8'?>
<corpus>
<article>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec in
enim at arcu tincidunt tristique. Ut commodo dui hendrerit lobortis
egestas. Orci varius natoque penatibus et magnis dis parturient montes,
nascetur ridiculus mus. Sed laoreet interdum enim ut cursus. Fusce
condimentum dictum dictum. Morbi feugiat bibendum enim, ut mollis turpis
tincidunt vitae. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Fusce libero ante, consectetur at sollicitudin at, eleifend lacinia ipsum.
In hac habitasse platea dictumst. Sed laoreet mi eu nisi
condimentum, sit amet vestibulum purus elementum. Nam a eros mi.</article>
</corpus>
那些更熟悉 xml 库的人可能会缩短它,但这是我能做的最好的...
我正在使用 lxml 写一个 XML 文件,在其中一个节点中,要写的内容是一个很长的字符串。 我正在寻找一种将这些字符串包装在 XML 节点中的方法。
目前,我尝试如下:
from lxml import etree
def lines_lenght(string, width):
words = string.split()
for i in range(0, len(words), width):
yield " ".join(words[i:i+width])
s = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec in enim at arcu tincidunt tristique. Ut commodo dui hendrerit lobortis egestas. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Sed laoreet interdum enim ut cursus. Fusce condimentum dictum dictum. Morbi feugiat bibendum enim, ut mollis turpis tincidunt vitae. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce libero ante, consectetur at sollicitudin at, eleifend lacinia ipsum. In hac habitasse platea dictumst. Sed laoreet mi eu nisi condimentum, sit amet vestibulum purus elementum. Nam a eros mi.
"""
root = etree.Element("corpus")
doc = etree.ElementTree(root)
article_node = etree.SubElement(root, "article")
final_content = "\n".join(lines_lenght(s, 10))
article_node.text = final_content
doc.write("corpus.xml", xml_declaration=True, encoding="utf-8")
但是在生成的XML文件中,换行符似乎没有被保留。根据

代替\n,但结果是一样的。
有什么提示可以帮助我吗?
编辑:这是我尝试实现的预览:
<corpus>
<article>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec in
enim at arcu tincidunt tristique. Ut commodo dui hendrerit lobortis
egestas. Orci varius natoque penatibus et magnis dis parturient montes</article>
</corpus>
而不是:
<corpus>
<article>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec in enim at arcu tincidunt tristique. Ut commodo dui hendrerit lobortis egestas. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.</article>
</corpus>
好吧,到达那里花了一段时间,在路上我不得不寻求 this answer here 的帮助,并搬出 lxml(正如其他人所说,是一个很棒的库,但有很多限制),到 python 内置。
Is 像你一样开始,但在 article_node.text = final_content
之后(doc.write()
之前)立即停止。并从上面链接的答案中添加:
def indent(elem, level=0):
i = "\n" + level*" "
if len(elem):
#print(len(elem))
if not elem.text or not elem.text.strip():
elem.text = i + " "
if not elem.tail or not elem.tail.strip():
elem.tail = i
for elem in elem:
indent(elem, level+1)
if not elem.tail or not elem.tail.strip():
elem.tail = i
else:
if level and (not elem.tail or not elem.tail.strip()):
elem.tail = i
然后是:
import xml.etree.ElementTree as ET
root2 = ET.fromstring(etree.tostring(doc))
tree = ET.ElementTree(root2)
indent(root2)
tree.write("corpus.xml", encoding="utf-8", xml_declaration=True)
测试一下:
with open("corpus.xml") as f:
print(f.read())
输出:
<?xml version='1.0' encoding='utf-8'?>
<corpus>
<article>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec in
enim at arcu tincidunt tristique. Ut commodo dui hendrerit lobortis
egestas. Orci varius natoque penatibus et magnis dis parturient montes,
nascetur ridiculus mus. Sed laoreet interdum enim ut cursus. Fusce
condimentum dictum dictum. Morbi feugiat bibendum enim, ut mollis turpis
tincidunt vitae. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Fusce libero ante, consectetur at sollicitudin at, eleifend lacinia ipsum.
In hac habitasse platea dictumst. Sed laoreet mi eu nisi
condimentum, sit amet vestibulum purus elementum. Nam a eros mi.</article>
</corpus>
那些更熟悉 xml 库的人可能会缩短它,但这是我能做的最好的...