通过 BeautifulSoup 从 <div> 元素中的自定义 <h2> 中提取文本

Question

您好，我尝试从 h2 中提取名称，但出现错误，名称是从其他 <h2> 中提取的，我想从仅从 <div class="poap serp-container lawyer"><div class="gray_border"><div class="col-lg-8 col-md-8 col-sm-9 col-xs-8 text_container"><h2 class=""indigo_text>Hi My name is Mark</h2></div></div></div>[= 指定的 <h2> 中提取名称17=]

import requests
import csv
from bs4 import BeautifulSoup
from itertools import zip_longest
name = []
page_num = 1
phone = []
logo = []
website = []
links = []
while True:
    try:
        result = requests.get(f"https://attorneys.superlawyers.com/motor-vehicle-accidents/texas/houston/page{page_num}/")
        src = result.content
        soup = BeautifulSoup(src, "lxml")
        page_limit = int("126")
        if(page_num > page_limit // 20):
            print("page ended, terminate")
            break
        names = soup.find_all("h2", {"class":"indigo_text"})
        for i in range(len(names)) :
            name.append(names[i].text.strip())
            links.append(names[i].find("a").attrs["href"])
        for link in links:
            result = requests.get(link)
            src = result.content
            soup = BeautifulSoup(src, "lxml")
            phones = soup.find("a", {"class":"profile-phone-header profile-contact-btn"})
            phone.append(phones["href"])
            logos = soup.find("div", {"class":"photo-container"})
            logo.append(logos.find('img')['src'])
            websites = soup.find("a", {"class":"profile-website-header","id":"firm_website"})
            website.append(websites.text.strip())
        page_num +=1
        print("page switched")
    except:
        print("error")
        break
file_list = [name, phone, website, logo]
exported = zip_longest(*file_list)
with open("/Users/dsoky/Desktop/fonts/Moaaz.csv", "w") as myfile:
    wr = csv.writer(myfile)
    wr.writerow(["name","phone","website","logo"])
    wr.writerows(exported)

希望大家帮我解决这个问题

Answer 1

Select 您的标签更具体，例如以下 css selector:

names = soup.select('div.poap h2')

或所有类:

names = soup.select('div.poap.serp-container.lawyer h2.indigo_text')

注意 这个答案只是针对问题的要点，可以改进代码以避免一些副作用。

通过 BeautifulSoup 从 <div> 元素中的自定义 <h2> 中提取文本

Extract text from custom <h2> in <div> elements by BeautifulSoup

python

beautifulsoup

python-requests