从 HTML 中提取文本，排除 <small> 标签中的文本

Question

我想从 HTML 中提取没有 <small> 标签的文本：

<h1>THE BIG TEXT<small>the small text</small></h1>

我可以用 //h1/text() 提取 "THE BIG TEXT the small text"，但是我怎样才能只提取 "THE BIG TEXT" 而不用 "the small text"？

我必须使用什么 XPath？

Answer 1

以下 XPath 应该有效：

//h1/text()

它将找到 h1 标签内的直接文本，而不是子标签。它提取 "THE BIG TEXT".

演示 here.

但是如果您想提取 h1 中的所有文本，包括 子标签：

//h1//text()

提取"THE BIG TEXT the small text".

查看单斜线和双斜线 (/)。单 / 表示立即，双 / 表示全部包括嵌套。

Extract text from HTML, exclude text in <small> tags