使用选择器保留某些文本并丢弃某些元素中的其余部分

Question

从下面的 html 元素中，我如何选择保留文本 hi there!! 并使用 css 选择器丢弃其他文本 Cat？此外，使用 .text 或 .text.strip() 我没有得到结果，但是当我使用 .text_content() 时我得到了文本。

from lxml.html import fromstring

html="""
<div id="item_type" data-attribute="item_type" class="ms-crm-Inline" aria-describe="item_type_c">
    <div>
        <label for="item_type_outer" id="Type_outer">
            <div class="NotVisible">Cat</div>
        Hi there!!
            <div class="GradientMask"></div>
        </label>
    </div>
</div>
"""
root = fromstring(html)
for item in root.cssselect("#Type_outer"):
    print(item.text)  # doesn't work
    print(item.text.strip()) # doesn't work
    print(item.text_content()) # working one

结果：

Cat 
Hi there!!

然而，我想要得到的结果只是 hi there!!，为此我尝试的是：

root.cssselect("#Type_outer:not(.NotVisible)") #it doesn't work either

再次提出问题：

为什么 .text_content() 有效但 .text 或 .text.strip() 无效？
如何使用 css 选择器只获得 hi there!!？

Answer 1

在lxml树模型中，你要获取的文本在div的tail和class "NotVisible" :

>>> root = fromstring(html)
>>> for item in root.cssselect("#Type_outer > div.NotVisible"):
...     print(item.tail.strip())
...
Hi there!!

所以要回答第一个问题，只有没有元素开头的文本节点在父节点的 text 属性中。具有前同级元素的文本节点（如本问题中的那个）将位于该元素的 tail 属性中。

获取文本 "Hi there!!" 的另一种方法是查询 label 的直接子节点的非空文本节点。可以使用 XPath 表达式完成这种详细程度的查询：

for item in root.cssselect("#Type_outer"):
    print(item.xpath("text()[normalize-space()]")[0].strip())

使用选择器保留某些文本并丢弃某些元素中的其余部分

Keeping certain text and discarding the rest from some elements using selector

python

lxml

css-selectors

web-scraping

python-3.x