Selenium 网络蜘蛛无法使用 Beautiful Soup 连续抓取两个 table <td> 标签

Question

我正在尝试使用 Python、Selenium 和 Beautiful Soup 从 ASP site 中获取多个学校和年份的成绩和教室规模。然后，我的最终目标是将该数据放入 pandas 数据框中以进行 csv 导出。此时在我的脚本中，

cells = rows.find_all('td')

我收到这个错误：

ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

我不太确定我做错了什么，它似乎为我提供了针对同一问题的两种相反的解决方案。这是我运行的脚本，print() 语句似乎表明其他一切都很好，Stack Overflow 上的其他解决方案似乎没有提供任何见解。任何帮助将不胜感激。

all_data = []

s = open("toronto_school_ids.txt", "r")
m = s.read().splitlines()

for i in range (0, len(m)):
    school_id = m[i]

    print ("Beginning the search for all schools in the Toronto District School Board...")

    # ontario class size tracker website
    driver.get("https://www.app.edu.gov.on.ca/eng/cst/classSize2.asp?sch_no=" + school_id)
    print("Got the Ontario website...")

    s = open("years.txt", "r")
    m = s.read().splitlines()

    for i in range (0, len(m)):
        year = m[i]

        # selenium takes over
        dropdown = Select(driver.find_element_by_name("schYR"))
        dropdown.select_by_value(year)
        print("Got the year we wanted to search for...")

        # Now we can grab the search button and click it
        search_button = driver.find_element_by_id("frmYearsSubmit")
        search_button.click()
        print("Searching for said year...")

        time.sleep(5)

        # We can feed that into Beautiful Soup
        soup = BeautifulSoup(driver.page_source, "html.parser")

        print("The name of the school we are searching for is...")
        school_name = soup.find_all('h2')[0].get_text()

        table = soup.find('tbody')
        rows = table.find_all('tr')

        all_data = []

        cells = rows.find_all('td')

        print("Now to get grades and class sizes...")

        for cell in cells:
                grade = cell.find('td', {"style": "border:1px solid #000000; padding-left:3px"}).get_text(strip=True)
                students = cell.find('td', {"style": "border:1px solid #000000;"}).get_text(strip=True)


                all_data.append({'School ID number': school_id, 'School': school_name, 'Year': year, 'Grade': grade, 'Classroom size': students})

                print(grade)
                print(students)

Answer 1

当您从文件中读入数据后关闭它。您想在完成年份循环后返回 url。您可以通过循环 tr 和 td 值并写入 csv 文件来立即将 table 放入 csv。

import csv

idsFile = open("toronto_school_ids.txt", "r")
ids = idsFile.read().splitlines()
idsFile.close()
yearsFile = open("years.txt", "r")
years = yearsFile.read().splitlines()
yearsFile.close()
print ("Beginning the search for all schools in the Toronto District School Board...")
# ontario class size tracker website
with open('data.csv', 'w', newline='') as csvfile:
    for i in range (0, len(ids)):
        school_id = ids[i]
        driver.get("https://www.app.edu.gov.on.ca/eng/cst/classSize2.asp?sch_no=" + school_id)
        name=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#right_column > h2"))).text
        print(name)
        for j in range (0, len(years)):
            dropdown = Select(driver.find_element_by_name("schYR"))
            dropdown.select_by_value(years[j])
            print("Got the year we wanted to search for...")
            search_button = driver.find_element_by_id("frmYearsSubmit")
            search_button.click()
            print("Searching for said year...")         
            table= WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "td > table"))) 
            wr = csv.writer(csvfile)
            for row in table.find_elements_by_css_selector('tr'):
                wr.writerow([d.text for d in row.find_elements_by_css_selector('td')])
            time.sleep(5)
        driver.back()
    csvfile.close()

产出

Selenium 网络蜘蛛无法使用 Beautiful Soup 连续抓取两个 table <td> 标签

Selenium web spider fails to scrape two table <td> tags in a row using Beautiful Soup

python

selenium

beautifulsoup

web-crawler

web-scraping