如何从 lxml text_content() 中排除特定标签锚定的文本
How to exclude text anchored by specific tags from lxml text_content()
我知道有类似的问题,但是因为他们没有解决问题,所以请多多包涵我为什么要再看一遍这个问题。
这是我的字符串:
normal = """
<p>
<b>
<a href='link1'> Forget me </a>
</b> I need this one <br>
<b>
<a href='link2'> Forget me too </a>
</b> Forget me not <i>even when</i> you go to sleep <br>
<b> <a href='link3'> Forget me three </a>
</b> Foremost on your mind <br>
</p>
"""
我开始于:
target = lxml.html.fromstring(normal)
tree_struct = etree.ElementTree(target)
现在,我基本上需要忽略所有由 <a>
标签锚定的内容。但是如果我 运行 这个代码:
for e in target.iter():
item = target.xpath(tree_struct.getpath(e))
if len(item)>0:
print(item[0].text)
我一无所获;另一方面,如果我将 print
指令更改为:
print(item[0].text_content())
我得到这个输出:
Forget me
I need this one
Forget me too
Forget me not
even when
you go to sleep
Forget me three
Foremost on your mind
虽然我想要的输出是:
I need this one
Forget me not
even when
you go to sleep
Foremost on your mind
除了给出错误的输出之外,它也不优雅。所以我一定是遗漏了一些明显的东西,虽然我不知道是什么。
我认为你把这件事变得不必要地复杂了。无需创建 tree_struct
对象并使用 getpath()
。这里有一个建议:
from lxml import html
normal = """
<p>
<b>
<a href='link1'> Forget me </a>
</b> I need this one <br>
<b>
<a href='link2'> Forget me too </a>
</b> Forget me not <i>even when</i> you go to sleep <br>
<b> <a href='link3'> Forget me three </a>
</b> Foremost on your mind <br>
</p>
"""
target = html.fromstring(normal)
for e in target.iter():
if not e.tag == "a":
# Print text content if not only whitespace
if e.text and e.text.strip():
print(e.text.strip())
# Print tail content if not only whitespace
if e.tail and e.tail.strip():
print(e.tail.strip())
输出:
I need this one
Forget me not
even when
you go to sleep
Foremost on your mind
我知道有类似的问题,但是因为他们没有解决问题,所以请多多包涵我为什么要再看一遍这个问题。
这是我的字符串:
normal = """
<p>
<b>
<a href='link1'> Forget me </a>
</b> I need this one <br>
<b>
<a href='link2'> Forget me too </a>
</b> Forget me not <i>even when</i> you go to sleep <br>
<b> <a href='link3'> Forget me three </a>
</b> Foremost on your mind <br>
</p>
"""
我开始于:
target = lxml.html.fromstring(normal)
tree_struct = etree.ElementTree(target)
现在,我基本上需要忽略所有由 <a>
标签锚定的内容。但是如果我 运行 这个代码:
for e in target.iter():
item = target.xpath(tree_struct.getpath(e))
if len(item)>0:
print(item[0].text)
我一无所获;另一方面,如果我将 print
指令更改为:
print(item[0].text_content())
我得到这个输出:
Forget me
I need this one
Forget me too
Forget me not
even when
you go to sleep
Forget me three
Foremost on your mind
虽然我想要的输出是:
I need this one
Forget me not
even when
you go to sleep
Foremost on your mind
除了给出错误的输出之外,它也不优雅。所以我一定是遗漏了一些明显的东西,虽然我不知道是什么。
我认为你把这件事变得不必要地复杂了。无需创建 tree_struct
对象并使用 getpath()
。这里有一个建议:
from lxml import html
normal = """
<p>
<b>
<a href='link1'> Forget me </a>
</b> I need this one <br>
<b>
<a href='link2'> Forget me too </a>
</b> Forget me not <i>even when</i> you go to sleep <br>
<b> <a href='link3'> Forget me three </a>
</b> Foremost on your mind <br>
</p>
"""
target = html.fromstring(normal)
for e in target.iter():
if not e.tag == "a":
# Print text content if not only whitespace
if e.text and e.text.strip():
print(e.text.strip())
# Print tail content if not only whitespace
if e.tail and e.tail.strip():
print(e.tail.strip())
输出:
I need this one
Forget me not
even when
you go to sleep
Foremost on your mind