Bsoup4 提取未被 parent 元素包装的 child 元素

Question

上下文

此 post 假定以下上下文：

python 2.7
bsoup4
使用 non-wrapped（相邻）元素抓取内容

问题

目标

Trevor 希望提取相关内容未被统一元素包裹，而是与 header 元素相邻的页面内容。
在下面的示例中，Trevor 想要一个包含四个元素的 python 数据结构，每个元素包含一个 'header' name-value 对和一个 'body' name-value对.

详情

最好的解释方式是举例：

<h2>Alpha blurb</h2>

* content here one
* content here two

<h2>Bravo blurb</h2>

* content here one
* content here two
* content here tree
* content here four
* content here fyve
* content here seeks

<h2>Charlie blurb</h2>

* content here four
* content here fyve
* content here seeks

<h2>Delta blurb</h2>

* blah

据 Trevor 到目前为止所见，Bsoup 使用一种策略来抓取内容，该策略包括查找容器元素并迭代它们并钻取它们。

但是，在这种情况下，Trevor 希望提取每个 Header 项及其关联内容，即使关联内容未包含在包含元素中。

一个内容部分开始和另一个内容部分结束的唯一指示是 header 标签的位置。

问题

在 bsoup4 的文档中可以在哪里搜索，或者 Trevor 可以寻找什么术语来概括这一原则并获得他正在尝试做的事情的结果？

Answer 1

Trevor 需要在此处侧身并使用.next_siblings。示例：

from bs4 import BeautifulSoup


page = """
<div>
<h2>Alpha blurb</h2>

* content here one
* content here two

<h2>Bravo blurb</h2>

* content here one
* content here two
* content here tree
* content here four
* content here fyve
* content here seeks

<h2>Charlie blurb</h2>

* content here four
* content here fyve
* content here seeks

<h2>Delta blurb</h2>

* blah
</div>
"""
soup = BeautifulSoup(page)

for h2 in soup.find_all("h2"):

    print h2.text

    # loop over siblings until h2 is met (or no more siblings left)
    for item in h2.next_siblings:
        if item.name == "h2":
            break

        print item.strip()

    print "----"

打印：

Alpha blurb
* content here one
* content here two
----
Bravo blurb
* content here one
* content here two
* content here tree
* content here four
* content here fyve
* content here seeks
----
Charlie blurb
* content here four
* content here fyve
* content here seeks
----
Delta blurb
* blah
----

Bsoup4 提取未被 parent 元素包装的 child 元素

Bsoup4 extracting child elements that are not wrapped by a parent element

html

python

beautifulsoup

web-scraping

上下文

问题

目标

详情

问题