使用 Selenium Python 从包含多个下拉选项的动态 table 抓取数据

Question

我是网络抓取的新手，目前正试图从这个 site 抓取有关所有水务设施的信息，该 site 具有不同区域的选项并输出到 csv 文件。

本站url不变；每次下拉选项 selected 时它都保持不变。到目前为止，我的代码（受此的影响能够 select 选项中的第一个区域，但它似乎没有进一步发展。到目前为止我有以下内容：

from bs4 import BeautifulSoup
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select


url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Chrome()
browser.get(url)
time.sleep(4)
print("Retriving the site...")

# All regions available
regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada', 'Middle East and Northern Africa', 'South Asia']

for region in regions:
   print("Starting output for the region: " + region)

   # Select all options from drop down menu
   selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))

   print("Now constructing output for: " + region)

   # Select table and wait for data to populate
   selectOption.select_by_visible_text(region)

   time.sleep(4)

   # Select the table containing the data and select all rows
   table = browser.find_element_by_xpath("//*[@id='MainContent_gvUtilities']")
   print(table)
   table_rows = table.find_elements_by_xpath(".//tr")

   # Create a list for each column in the table with each column number
   utility_name = [] #0
   country = [] #2
   city = []    #3
   population = [] #4

   for row in table_rows:
      column_element = row.find_elements_by_xpath(".//td")
      utility_name.append(column_element[0])
      country.append(column_element[2])
      city.append(column_element[3])
      population.append(column_element[4])

   #Create a dictionary of all utilities for each region
   dict_output = {
       "Utility Name": utility_name,
       "Country": country,
       "City": city, 
       "Population": population,
   }

   df = pd.DataFrame.from_dict(dict_output)
   df.to_csv(region, index = False)


browser.close()
browser.quit()

我每次都会收到这个错误：

  File "/home/ken/.local/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: chrome=91.0.4472.77)
  (Driver info: chromedriver=2.26.436382 (70eb799287ce4c2208441fc057053a5b07ceabac),platform=Linux 5.8.0-59-generic x86_64)

我被困在这里，我似乎无法弄清楚我做错了什么，或者我实际上应该做什么来解决这个错误。对此的任何帮助或指示将不胜感激！

谢谢！！

Answer 1

我似乎无法重现您的错误。但是运行它和以下几点：

您的 regions 列表中有错字： 'Latin America (including USA and Canada' 应该是 'Latin America (including USA and Canada)'
您是否考虑过使用 pandas 来解析 table？它在后台使用 BeautifulSoup，并为您完成大部分工作。

代码：

import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select


url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Chrome()
browser.get(url)
time.sleep(4)
print("Retriving the site...")

# All regions available
regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada)', 'Middle East and Northern Africa', 'South Asia']

for region in regions:
   print("Starting output for the region: " + region)

   # Select all options from drop down menu
   selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))

   print("Now constructing output for: " + region)

   # Select table and wait for data to populate
   selectOption.select_by_visible_text(region)

   time.sleep(4)

   # Select the table containing the data and select all rows
   table = pd.read_html(browser.page_source)[0][:-1].dropna(axis=1)
   print(table)

   table.csv(region, index = False)


browser.close()
browser.quit()

使用 Selenium Python 从包含多个下拉选项的动态 table 抓取数据

Scraping data from a dynamic table containing multiple drop-down options using Selenium Python

python

selenium

web-scraping

drop-down-menu