Select 来自以下节点的整个文本，其中 child 个节点使用 python 中的 xpath 查询

Question

我想在python中提取a tag和XPath的以下节点的内容。到目前为止，我设法提取了其中没有内部标签的内容。问题是，如果以下节点中有一个 child 节点，我的方法将不起作用。我正在使用 lxml 包，这是我的代码：

from lxml.html import etree, fromstring

reference_titles = root.xpath("//table[@id='vulnrefstable']/tr/td")
for tree in reference_titles:
    a_tag = tree.xpath('a/@href')[0]
    title = tree.xpath('a/following-sibling::text()')

这适用于此 html:

<tr>

    <td class="r_average">

        <a href="http://somelink.com" target="_blank" title="External url">
            http://somelink.com
        </a>
        <br/> SECUNIA 27633                     
    </td>

</tr>

这里的标题是正确的"SECUNIA 27633"但是在这个html:

<tr>

    <td class="r_average">

        <a href="http://somelink.com" target="_blank" title="External url">
            http://somelink.com
        </a>
        <br/> SECUNIA 27633     <i>Release Date:</i> tomorrow               
    </td>

</tr>

结果是“SECUNIA 27633 tomorrow”

如何提取“SECUNIA 27633 Release Date: tomorrow”？

编辑： 在 XPath returns 中的所有节点中使用 node() 而不是 text()。所以我使用它并使用嵌套的 for 语句

创建最终字符串

title = tree.xpath('a/following-sibling::node()')

但我想知道是否有更好的方法来简单地提取文本内容而不管 child 节点与 XPath 查询

Answer 1

试试这个：

for tree in reference_titles:
    a_tag = tree.xpath('a/@href')[0]
    title = " ".join([node.strip() for node in tree.xpath('.//text()[not(parent::a)]') if node.strip()])

Select 来自以下节点的整个文本，其中 child 个节点使用 python 中的 xpath 查询

Select the entire text from the following node with child nodes using xpath query in python

python

xpath

lxml

html-parsing

python-3.x