如何从 lxml text_content() 中排除特定标签锚定的文本

Question

我知道有类似的问题，但是因为他们没有解决问题，所以请多多包涵我为什么要再看一遍这个问题。

这是我的字符串：

normal = """
  <p>
    <b>
      <a href='link1'>        Forget me  </a>
    </b>     I need this one      <br>
    <b>
     <a href='link2'>  Forget me too  </a>
    </b> Forget me not <i>even when</i> you go to sleep <br>
    <b>  <a href='link3'>  Forget me three  </a>
    </b>  Foremost on your mind <br>
   </p>    
"""

我开始于：

target = lxml.html.fromstring(normal)
tree_struct = etree.ElementTree(target)

现在，我基本上需要忽略所有由 <a> 标签锚定的内容。但是如果我运行这个代码：

for e in target.iter():
   item = target.xpath(tree_struct.getpath(e))
   if len(item)>0:
       print(item[0].text)

我一无所获；另一方面，如果我将 print 指令更改为：

  print(item[0].text_content())

我得到这个输出：

Forget me
 I need this one

 Forget me too

Forget me not
even when
you go to sleep


 Forget me three

Foremost on your mind

虽然我想要的输出是：

 I need this one

Forget me not
even when
you go to sleep    

Foremost on your mind

除了给出错误的输出之外，它也不优雅。所以我一定是遗漏了一些明显的东西，虽然我不知道是什么。

Answer 1

我认为你把这件事变得不必要地复杂了。无需创建 tree_struct 对象并使用 getpath()。这里有一个建议：

from lxml import html

normal = """
  <p>
    <b>
      <a href='link1'>        Forget me  </a>
    </b>     I need this one      <br>
    <b>
     <a href='link2'>  Forget me too  </a>
    </b> Forget me not <i>even when</i> you go to sleep <br>
    <b>  <a href='link3'>  Forget me three  </a>
    </b>  Foremost on your mind <br>
   </p>
"""

target = html.fromstring(normal)

for e in target.iter():
    if not e.tag == "a":
        # Print text content if not only whitespace 
        if e.text and e.text.strip():
            print(e.text.strip())
        # Print tail content if not only whitespace
        if e.tail and e.tail.strip():
            print(e.tail.strip())

输出：

I need this one
Forget me not
even when
you go to sleep
Foremost on your mind

如何从 lxml text_content() 中排除特定标签锚定的文本

How to exclude text anchored by specific tags from lxml text_content()

python

xpath

lxml