python的美汤爬取时如何判断一个逻辑部分

Question

所以现在我一直有架构：

<h2 class="dot">headline 1</h2>
<p>text</p>
<h2 class="dot">headline 2</h2>
<p>text</p>

但我抓取的某些网站可能具有以下架构：

<h2 class="dot">headline 1</h2>
<p>text</p>
<p>text</p>
<h2 class="dot">headline 2</h2>
<p>text</p>

我是这样抓取的：

for product in soup.findAll("p"):

我看不出有什么办法可以确定不同的 p 元素是否属于一起。有人知道我如何确定一个或两个 p 是否属于同一个逻辑单元吗？

一种可能的方法是确定前导 html 元素是 p 还是 h2。有什么好的方法可以查出来吗？

Answer 1

给你：

from bs4 import BeautifulSoup

html="""
<div>
<h2 class="dot">headline 1</h2>
<p>text</p>
<p>text</p>
<h2 class="dot">headline 2</h2>
<p>text</p>
</div>
"""

soup = BeautifulSoup(html)

for h2 in soup.findAll("h2"):
    group = []
    node = h2.next_sibling

    while node is not None and node.name != "h2":
        group.append(node)
        node = node.next_sibling

    # Do w/e you want w/ the group
    print group

我所做的是遍历所有 h2 元素，遍历它们的下一个兄弟姐妹并将它们附加到列表中，直到你运行没有兄弟姐妹或找到另一个 h2。如果你只想要 <p> 个元素，那么你应该更改：

group.append(node)

至：

if node.name == "p":
    group.append(node)

哦，作为最后的最后评论。除非你真的需要一个列表，否则最好只做 w/e 你想要的东西在循环中而不是将它添加到列表中，就像这样：

from bs4 import BeautifulSoup

html="""
<div>
<h2 class="dot">headline 1</h2>
<p>text</p>
<p>text</p>
<h2 class="dot">headline 2</h2>
<p>text</p>
</div>
"""

soup = BeautifulSoup(html)

for h2 in soup.findAll("h2"):
    node = h2.next_sibling

    print "This h2", h2

    while node is not None and node.name != "h2":
        if node.name == "p":
            print node
        node = node.next_sibling

输出：

This h2 <h2 class="dot">headline 1</h2>
<p>text</p>
<p>text</p>
This h2 <h2 class="dot">headline 2</h2>
<p>text</p>

python的美汤爬取时如何判断一个逻辑部分

How to determine a logic part with python's beautiful soup when crawling

python

beautifulsoup

web-crawler