使用 beautifulsoup python 在 span class 网页内抓取值

Question

大家好我有一个网页我正在尝试抓取，该页面有大量跨度 classes，其中大部分是无用信息我发布了一部分跨度 class 数据我需要，但我无法做到 find.all 跨度，因为还有 100 个不需要。

            <div class="col-md-4">
                <p>
                  <span class="text-muted">File Number</span><br>
                  A-21-897274
                </p>
            </div>
            <div class="col-md-4">
              <p>
                <span class="text-muted">Location</span><br>
                Ohio
              </p>
            </div>
              <div class="col-md-4">
                <p>
                  <span class="text-muted">Date</span><br>
                  07/01/2022
                </p>
              </div>
          </div>

我需要跨度标题：
文件编号、位置、日期

然后匹配的值：
“A-21-897274”、“俄亥俄州”、“07/01/2022”

我需要打印出来，这样我就可以制作一个 pandas 数据框。但是我似乎无法打印出带有它们值的特定跨度。

我尝试过的：

import bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(..., 'lxml')
for title_tag in soup.find_all('span', class_='text-muted'):

# get the last sibling
*_, value_tag = title_tag.next_siblings

title = title_tag.text.strip()

if isinstance(value_tag, bs4.element.Tag):
    value = value_tag.text.strip()
else:  # it's a navigable string element
    value = value_tag.strip()

print(title, value)

输出：

File Number "A-21-897274"
Location "Ohio"
Operations_Manager "Joanna"
Date "07/01/2022"
Type "Transfer"
Status "Open"
ETC "ETC"
ETC "ETC"

这将打印出我需要的所有内容，但它也会打印出 100 个我不需要的其他值 want/need。

Answer 1

您可以使用 soup.find_all 中的函数来 select 只需要元素，然后 .find_next_sibling() 来 select 值。例如：

from bs4 import BeautifulSoup


html_doc = """
<div class="col-md-4">
    <p>
      <span class="text-muted">File Number</span><br>
      A-21-897274
    </p>
</div>
<div class="col-md-4">
  <p>
    <span class="text-muted">Location</span><br>
    Ohio
  </p>
</div>
  <div class="col-md-4">
    <p>
      <span class="text-muted">Date</span><br>
      07/01/2022
    </p>
  </div>
</div>
"""

soup = BeautifulSoup(html_doc, "html.parser")


def correct_tag(tag):
    return tag.name == "span" and tag.get_text(strip=True) in {
        "File Number",
        "Location",
        "Date",
    }


for t in soup.find_all(correct_tag):
    print(f"{t.text}: {t.find_next_sibling(text=True).strip()}")

打印：

File Number: A-21-897274
Location: Ohio
Date: 07/01/2022

使用 beautifulsoup python 在 span class 网页内抓取值

Scrape values inside span class webpage with beautifulsoup python

python

beautifulsoup

web-scraping

pandas