如何select 标签之前的标签内容？

Question

我的 html 页面看起来像这样：

<div>
<h1>First Item</h1>
<p> the text I want </p>
</div>

<div>
<h1>Second Item</h1>
<p> the text I don't want </p>
</div>

"First Item"的标题可能位于每个页面抓取中的不同标签级别，因此索引不固定。

我想要一些看起来像（pseudo-code）的选择。

from lxml import html

locate_position = locate(html.xpath(//div/h1[contains("First Item")])))

scrape = html.xpath(//div[locate_position]/p)

Answer 1

如果您只想匹配前面的兄弟：

/p/preceding-sibling::contains(h1,"First Item")

更接近您的示例的选项是：

/div[contains(h1, "First Item")]/p

得到的 p 是 div 的 children 的 h1 child.

Answer 2

如果您准备考虑使用 bs4 4.7.1，这很容易。您可以使用 :contains pseudo class 来指定 h1 必须包含搜索字符串，并使用 adjacent sibling combinator 来指定匹配必须紧跟 p 标记。

The adjacent sibling combinator (+) separates two selectors and matches the second element only if it immediately follows the first element, and both are children of the same parent element.

from bs4 import BeautifulSoup as bs

html = '''
<div>
<h1>First Item</h1>
<p> the text I want </p>
</div>

<div>
<h1>Second Item</h1>
<p> the text I don't want </p>
</div>
'''

soup = bs(html, 'lxml')

#multiple matches possible
matches = [match.text for match in soup.select('h1:contains("First Item") + p')]
print(matches)

# first match (useful if only one match expected or first required)
soup.select_one('h1:contains("First Item") + p').text

如何select 标签之前的标签内容？

How to select tag by content of tag before it?

python

lxml

web-scraping