使用 Selenium Python 从包含多个下拉选项的动态 table 抓取数据
Scraping data from a dynamic table containing multiple drop-down options using Selenium Python
我是网络抓取的新手,目前正试图从这个 site 抓取有关所有水务设施的信息,该 site 具有不同区域的选项并输出到 csv 文件。
本站url不变;每次下拉选项 selected 时它都保持不变。到目前为止,我的代码(受此 的影响能够 select 选项中的第一个区域,但它似乎没有进一步发展。到目前为止我有以下内容:
from bs4 import BeautifulSoup
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select
url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Chrome()
browser.get(url)
time.sleep(4)
print("Retriving the site...")
# All regions available
regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada', 'Middle East and Northern Africa', 'South Asia']
for region in regions:
print("Starting output for the region: " + region)
# Select all options from drop down menu
selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))
print("Now constructing output for: " + region)
# Select table and wait for data to populate
selectOption.select_by_visible_text(region)
time.sleep(4)
# Select the table containing the data and select all rows
table = browser.find_element_by_xpath("//*[@id='MainContent_gvUtilities']")
print(table)
table_rows = table.find_elements_by_xpath(".//tr")
# Create a list for each column in the table with each column number
utility_name = [] #0
country = [] #2
city = [] #3
population = [] #4
for row in table_rows:
column_element = row.find_elements_by_xpath(".//td")
utility_name.append(column_element[0])
country.append(column_element[2])
city.append(column_element[3])
population.append(column_element[4])
#Create a dictionary of all utilities for each region
dict_output = {
"Utility Name": utility_name,
"Country": country,
"City": city,
"Population": population,
}
df = pd.DataFrame.from_dict(dict_output)
df.to_csv(region, index = False)
browser.close()
browser.quit()
我每次都会收到这个错误:
File "/home/ken/.local/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: chrome=91.0.4472.77)
(Driver info: chromedriver=2.26.436382 (70eb799287ce4c2208441fc057053a5b07ceabac),platform=Linux 5.8.0-59-generic x86_64)
我被困在这里,我似乎无法弄清楚我做错了什么,或者我实际上应该做什么来解决这个错误。对此的任何帮助或指示将不胜感激!
谢谢!!
我似乎无法重现您的错误。但是 运行 它和以下几点:
- 您的
regions
列表中有错字:
'Latin America (including USA and Canada'
应该是 'Latin America (including USA and Canada)'
- 您是否考虑过使用 pandas 来解析 table?它在后台使用 BeautifulSoup,并为您完成大部分工作。
代码:
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select
url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Chrome()
browser.get(url)
time.sleep(4)
print("Retriving the site...")
# All regions available
regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada)', 'Middle East and Northern Africa', 'South Asia']
for region in regions:
print("Starting output for the region: " + region)
# Select all options from drop down menu
selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))
print("Now constructing output for: " + region)
# Select table and wait for data to populate
selectOption.select_by_visible_text(region)
time.sleep(4)
# Select the table containing the data and select all rows
table = pd.read_html(browser.page_source)[0][:-1].dropna(axis=1)
print(table)
table.csv(region, index = False)
browser.close()
browser.quit()
我是网络抓取的新手,目前正试图从这个 site 抓取有关所有水务设施的信息,该 site 具有不同区域的选项并输出到 csv 文件。
本站url不变;每次下拉选项 selected 时它都保持不变。到目前为止,我的代码(受此
from bs4 import BeautifulSoup
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select
url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Chrome()
browser.get(url)
time.sleep(4)
print("Retriving the site...")
# All regions available
regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada', 'Middle East and Northern Africa', 'South Asia']
for region in regions:
print("Starting output for the region: " + region)
# Select all options from drop down menu
selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))
print("Now constructing output for: " + region)
# Select table and wait for data to populate
selectOption.select_by_visible_text(region)
time.sleep(4)
# Select the table containing the data and select all rows
table = browser.find_element_by_xpath("//*[@id='MainContent_gvUtilities']")
print(table)
table_rows = table.find_elements_by_xpath(".//tr")
# Create a list for each column in the table with each column number
utility_name = [] #0
country = [] #2
city = [] #3
population = [] #4
for row in table_rows:
column_element = row.find_elements_by_xpath(".//td")
utility_name.append(column_element[0])
country.append(column_element[2])
city.append(column_element[3])
population.append(column_element[4])
#Create a dictionary of all utilities for each region
dict_output = {
"Utility Name": utility_name,
"Country": country,
"City": city,
"Population": population,
}
df = pd.DataFrame.from_dict(dict_output)
df.to_csv(region, index = False)
browser.close()
browser.quit()
我每次都会收到这个错误:
File "/home/ken/.local/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: chrome=91.0.4472.77)
(Driver info: chromedriver=2.26.436382 (70eb799287ce4c2208441fc057053a5b07ceabac),platform=Linux 5.8.0-59-generic x86_64)
我被困在这里,我似乎无法弄清楚我做错了什么,或者我实际上应该做什么来解决这个错误。对此的任何帮助或指示将不胜感激!
谢谢!!
我似乎无法重现您的错误。但是 运行 它和以下几点:
- 您的
regions
列表中有错字:'Latin America (including USA and Canada'
应该是'Latin America (including USA and Canada)'
- 您是否考虑过使用 pandas 来解析 table?它在后台使用 BeautifulSoup,并为您完成大部分工作。
代码:
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select
url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Chrome()
browser.get(url)
time.sleep(4)
print("Retriving the site...")
# All regions available
regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada)', 'Middle East and Northern Africa', 'South Asia']
for region in regions:
print("Starting output for the region: " + region)
# Select all options from drop down menu
selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))
print("Now constructing output for: " + region)
# Select table and wait for data to populate
selectOption.select_by_visible_text(region)
time.sleep(4)
# Select the table containing the data and select all rows
table = pd.read_html(browser.page_source)[0][:-1].dropna(axis=1)
print(table)
table.csv(region, index = False)
browser.close()
browser.quit()