beautifulsoup 如何获取包含多个子标签的标签内的文本？

Question

我正在尝试抓取具有以下标签的网页：

  <div style="text-align: center;">
            <img src="https://documents.google.com/" alt="" width="60" height="30" />
            <br />
            Pick me please.

        <p> Do not pick me please! </p>

        <br />
        <br />
    </div>

我想抓取“请选我”字符串，但不想抓取“请不要选我！”细绳。知道怎么做吗？

编辑：我希望有一个更通用的解决方案，我总是希望在特定标签下获取文本，该标签不在任何子标签内

Answer 1

您也可以使用get_text()方法。它 returns 文档中或标签下的所有文本，作为单个 Unicode 字符串。这里我使用正则表达式 re.compile 来获取文本。

import re
from bs4 import BeautifulSoup
html= """<div style="text-align: center;">
            <img src="https://documents.google.com/" alt="" width="60" height="30" />
            <br />
            Pick me please.

        <p> Do not pick me please! </p>

        <br />
        <br />
    </div>"""

soup = BeautifulSoup(html, 'lxml')
print(soup.find(text=re.compile("Pick me please.")).strip())

Answer 2

编辑

find() div 中的非空 text node 的更“通用”解决方案：

parent = soup.select_one('div')
parent.find(text=lambda text: text and text.strip(), recursive=False).strip()

要获取文本节点，请使用 previous_sibling 并避免空格，... strip() 结果。

soup.select_one('div p').previous_sibling.strip()

或使用 get_text() 和 strip:

soup.select_one('div').get_text('|', strip=True).split('|')[0]

最小示例

from bs4 import BeautifulSoup

html = '''
<div style="text-align: center;">
            <img src="https://documents.google.com/" alt="" width="60" height="30" />
            <br />
            Pick me please.

        <p> Do not pick me please! </p>

        <br />
        <br />
    </div>
'''
soup = BeautifulSoup(html, 'lxml')

soup.select_one('div p').previous_sibling.strip()

输出

Pick me please.

Answer 3

您可以搜索 <br> 标签，然后调用 find_next() 方法，这将 return 第一个匹配项。

soup = BeautifulSoup(html, "html.parser")

print(soup.select_one('div br').find_next(text=True).strip())

输出：

Pick me please.

beautifulsoup 如何获取包含多个子标签的标签内的文本？

How to get the text enclosed within a tag, which contains multiple sub-tags, with beautifulsoup?

python

beautifulsoup

web-crawler

web-scraping

python-3.x

编辑