python - lxml - 如何优雅地标记文本中的单词?

python - lxml - How to tag words in texts gracefully?

我需要用 lxml 标记某些单词。以此为例,

<span>Please BOLD me, <br /> BOLD me too</span>

我需要找出所有特定的单词,'BOLD' 这里 ,并为它们添加标签。结果应该是:

<span>Please <b>BOLD</b> me, <br /> <b>BOLD</b> me too</span>

必须用lxml,不只是正则表达式的问题。它需要在标记之前进行一些程序计算。更多类似内容:

s = '<span>Please BOLD me, <br /> BOLD me too</span>'
from lxml import etree
et = etree.fromstring(s)
for e in et.iter():
    if 'BOLD' in e.text:
        **tag it**
    if 'BOLD' in e.tail:
        **tag it**

我想我需要创建一个元素bold = etree.Element('b'); bold.text = 'BOLD'

问题是我不知道如何优雅地插入上面的元素bold

您必须手动创建 <b> 元素并将其 .insert() 到位。将剩余的文本放在创建的元素的 tail 中:

import lxml.html
from lxml.html import builder as E

text = '''
<html>
 <body>
   <span>Please BOLD me</span>
 </body>
</html>
'''

doc = lxml.html.fromstring(text)
for span in doc.xpath('//span'):
    # search for the word "BOLD" in the span text:
    pre, sep, pos = span.text.partition('BOLD')
    if sep:
        span.text = pre
        bold = E.B(sep) # create element
        bold.tail = pos
        span.insert(0, bold)


print(lxml.html.tostring(doc, pretty_print=True))

结果:

<html>
 <body>
   <span>Please <b>BOLD</b> me</span>
 </body>
</html>

如果您在尾部找到它,那么您必须在父元素中插入新元素,就在您找到的元素之后:

parent = element.getparent()
parent.insert(parent.index(element) + 1, bold)