如何使用 lxml select 和更新混合内容中的文本节点?
How can I select and update text nodes in mixed content using lxml?
我需要检查 XML 文件中所有 text()
节点中的所有单词。我正在使用 XPath //text()
来 select 文本节点和正则表达式来 select 单词。如果该词存在于一组关键字中,我需要将其替换为某些内容并更新 XML.
通常设置元素的文本是使用 .text
完成的,但是 _Element 上的 .text
只会更改第一个子文本节点。在 mixed content element 中,其他文本节点实际上是其前一个兄弟节点的 .tail
。
如何更新所有文本节点?
在下面的简化示例中,我只是尝试将匹配的关键字括在方括号中...
输入XML
<doc>
<para>I think the only card she has <gotcha>is the</gotcha> Lorem card. We have so many things that we have to do
better... and certainly ipsum is one of them. When other <gotcha>websites</gotcha> give you text, they're not
sending the best. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of problems
and they're <gotcha>bringing</gotcha> those problems with us. They're bringing mistakes. They're bringing
misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>
期望输出
<doc>
<para>I think [the] only card she has <gotcha>[is] [the]</gotcha> Lorem card. We have so many things that we have to do
better... and certainly [ipsum] [is] one of them. When other <gotcha>websites</gotcha> give you text, they're not
sending [the] [best]. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of [problems]
and they're <gotcha>bringing</gotcha> those [problems] with us. They're bringing [mistakes]. They're bringing
misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>
我在文档中找到了这个解决方案的关键:Using XPath to find text
特别是 _ElementUnicodeResult 的 is_text
和 is_tail
属性。
使用这些属性,我可以判断是否需要更新 parent _Element.[=22= 的 .text
或 .tail
属性 ]
起初理解起来有点棘手,因为当您在文本节点 (_ElementUnicodeResult
) 上使用 getparent()
时,该文本节点是其前一个兄弟节点 (.is_tail == True
) 的尾部,前面的兄弟是作为 parent 返回的内容;不是实际的 parent.
示例...
Python
import re
from lxml import etree
xml = """<doc>
<para>I think the only card she has <gotcha>is the</gotcha> Lorem card. We have so many things that we have to do
better... and certainly ipsum is one of them. When other <gotcha>websites</gotcha> give you text, they're not
sending the best. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of problems
and they're <gotcha>bringing</gotcha> those problems with us. They're bringing mistakes. They're bringing
misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>
"""
def update_text(match, word_list):
if match in word_list:
return f"[{match}]"
else:
return match
root = etree.fromstring(xml)
keywords = {"ipsum", "is", "the", "best", "problems", "mistakes"}
for text in root.xpath("//text()"):
parent = text.getparent()
updated_text = re.sub(r"[\w]+", lambda match: update_text(match.group(), keywords), text)
if text.is_text:
parent.text = updated_text
elif text.is_tail:
parent.tail = updated_text
etree.dump(root)
输出(转储到控制台)
<doc>
<para>I think [the] only card she has <gotcha>[is] [the]</gotcha> Lorem card. We have so many things that we have to do
better... and certainly [ipsum] [is] one of them. When other <gotcha>websites</gotcha> give you text, they're not
sending [the] [best]. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of [problems]
and they're <gotcha>bringing</gotcha> those [problems] with us. They're bringing [mistakes]. They're bringing
misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>
我需要检查 XML 文件中所有 text()
节点中的所有单词。我正在使用 XPath //text()
来 select 文本节点和正则表达式来 select 单词。如果该词存在于一组关键字中,我需要将其替换为某些内容并更新 XML.
通常设置元素的文本是使用 .text
完成的,但是 _Element 上的 .text
只会更改第一个子文本节点。在 mixed content element 中,其他文本节点实际上是其前一个兄弟节点的 .tail
。
如何更新所有文本节点?
在下面的简化示例中,我只是尝试将匹配的关键字括在方括号中...
输入XML
<doc>
<para>I think the only card she has <gotcha>is the</gotcha> Lorem card. We have so many things that we have to do
better... and certainly ipsum is one of them. When other <gotcha>websites</gotcha> give you text, they're not
sending the best. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of problems
and they're <gotcha>bringing</gotcha> those problems with us. They're bringing mistakes. They're bringing
misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>
期望输出
<doc>
<para>I think [the] only card she has <gotcha>[is] [the]</gotcha> Lorem card. We have so many things that we have to do
better... and certainly [ipsum] [is] one of them. When other <gotcha>websites</gotcha> give you text, they're not
sending [the] [best]. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of [problems]
and they're <gotcha>bringing</gotcha> those [problems] with us. They're bringing [mistakes]. They're bringing
misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>
我在文档中找到了这个解决方案的关键:Using XPath to find text
特别是 _ElementUnicodeResult 的 is_text
和 is_tail
属性。
使用这些属性,我可以判断是否需要更新 parent _Element.[=22= 的 .text
或 .tail
属性 ]
起初理解起来有点棘手,因为当您在文本节点 (_ElementUnicodeResult
) 上使用 getparent()
时,该文本节点是其前一个兄弟节点 (.is_tail == True
) 的尾部,前面的兄弟是作为 parent 返回的内容;不是实际的 parent.
示例...
Python
import re
from lxml import etree
xml = """<doc>
<para>I think the only card she has <gotcha>is the</gotcha> Lorem card. We have so many things that we have to do
better... and certainly ipsum is one of them. When other <gotcha>websites</gotcha> give you text, they're not
sending the best. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of problems
and they're <gotcha>bringing</gotcha> those problems with us. They're bringing mistakes. They're bringing
misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>
"""
def update_text(match, word_list):
if match in word_list:
return f"[{match}]"
else:
return match
root = etree.fromstring(xml)
keywords = {"ipsum", "is", "the", "best", "problems", "mistakes"}
for text in root.xpath("//text()"):
parent = text.getparent()
updated_text = re.sub(r"[\w]+", lambda match: update_text(match.group(), keywords), text)
if text.is_text:
parent.text = updated_text
elif text.is_tail:
parent.tail = updated_text
etree.dump(root)
输出(转储到控制台)
<doc>
<para>I think [the] only card she has <gotcha>[is] [the]</gotcha> Lorem card. We have so many things that we have to do
better... and certainly [ipsum] [is] one of them. When other <gotcha>websites</gotcha> give you text, they're not
sending [the] [best]. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of [problems]
and they're <gotcha>bringing</gotcha> those [problems] with us. They're bringing [mistakes]. They're bringing
misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>