使用 Selenium Python 从包含多个下拉选项的动态 table 抓取数据

Scraping data from a dynamic table containing multiple drop-down options using Selenium Python

我是网络抓取的新手,目前正试图从这个 site 抓取有关所有水务设施的信息,该 site 具有不同区域的选项并输出到 csv 文件。

本站url不变;每次下拉选项 selected 时它都保持不变。到目前为止,我的代码(受此 的影响能够 select 选项中的第一个区域,但它似乎没有进一步发展。到目前为止我有以下内容:

from bs4 import BeautifulSoup
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select


url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Chrome()
browser.get(url)
time.sleep(4)
print("Retriving the site...")

# All regions available
regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada', 'Middle East and Northern Africa', 'South Asia']

for region in regions:
   print("Starting output for the region: " + region)

   # Select all options from drop down menu
   selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))

   print("Now constructing output for: " + region)

   # Select table and wait for data to populate
   selectOption.select_by_visible_text(region)

   time.sleep(4)

   # Select the table containing the data and select all rows
   table = browser.find_element_by_xpath("//*[@id='MainContent_gvUtilities']")
   print(table)
   table_rows = table.find_elements_by_xpath(".//tr")

   # Create a list for each column in the table with each column number
   utility_name = [] #0
   country = [] #2
   city = []    #3
   population = [] #4

   for row in table_rows:
      column_element = row.find_elements_by_xpath(".//td")
      utility_name.append(column_element[0])
      country.append(column_element[2])
      city.append(column_element[3])
      population.append(column_element[4])

   #Create a dictionary of all utilities for each region
   dict_output = {
       "Utility Name": utility_name,
       "Country": country,
       "City": city, 
       "Population": population,
   }

   df = pd.DataFrame.from_dict(dict_output)
   df.to_csv(region, index = False)


browser.close()
browser.quit()

我每次都会收到这个错误:

  File "/home/ken/.local/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: chrome=91.0.4472.77)
  (Driver info: chromedriver=2.26.436382 (70eb799287ce4c2208441fc057053a5b07ceabac),platform=Linux 5.8.0-59-generic x86_64)

我被困在这里,我似乎无法弄清楚我做错了什么,或者我实际上应该做什么来解决这个错误。对此的任何帮助或指示将不胜感激!

谢谢!!

我似乎无法重现您的错误。但是 运行 它和以下几点:

  1. 您的 regions 列表中有错字: 'Latin America (including USA and Canada' 应该是 'Latin America (including USA and Canada)'
  2. 您是否考虑过使用 pandas 来解析 table?它在后台使用 BeautifulSoup,并为您完成大部分工作。

代码:

import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select


url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Chrome()
browser.get(url)
time.sleep(4)
print("Retriving the site...")

# All regions available
regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada)', 'Middle East and Northern Africa', 'South Asia']

for region in regions:
   print("Starting output for the region: " + region)

   # Select all options from drop down menu
   selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))

   print("Now constructing output for: " + region)

   # Select table and wait for data to populate
   selectOption.select_by_visible_text(region)

   time.sleep(4)

   # Select the table containing the data and select all rows
   table = pd.read_html(browser.page_source)[0][:-1].dropna(axis=1)
   print(table)

   table.csv(region, index = False)


browser.close()
browser.quit()