Selenium 不会进入刮板中的下一页
Selenium not going to next page in scraper
我正在编写我的第一个真正的抓取工具,虽然总体上进展顺利,但我在使用 Selenium 时遇到了瓶颈。我无法让它转到下一页。
下面是我的代码的头部。下面的输出只是暂时在终端中打印出数据,一切正常。它只是在第 1 页末尾停止抓取并显示我的终端提示。它永远不会从第 2 页开始。如果有人能提出建议,我将不胜感激。我已经尝试选择页面底部的按钮,我试图使用相对和完整的 Xpath(你在这里看到完整的)来抓取,但都不起作用。我正在尝试单击右箭头按钮。
我内置了我自己的错误消息来指示驱动程序是否通过 Xpath 成功找到了元素。当我执行我的代码时会触发错误消息,所以我猜它没有找到元素。我只是不明白为什么不。
# Importing libraries
import requests
import csv
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Import selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException
import time
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome("/path/to/driver", options=options)
# Yes, I do have the actual path to my driver in the original code
driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
time.sleep(5)
while True:
try:
driver.find_element_by_xpath('/html/body/div[1]/div[3]/div/div/form/div[3]/div/div/ul[1]/li[4]/a').click()
except (TimeoutException, WebDriverException) as e:
print("A timeout or webdriver exception occurred.")
break
driver.quit()
您可以设置 Selenium expected conditions
(visibility_of_element_located
, element_to_be_clickable
) 并使用相对 XPath 到 select 下一个页面元素。所有这些都在一个循环中(它的范围是你必须处理的页数)。
下一页的 XPath link :
//div[@class='pagination ctm-pagination']/ul[1]/li[last()-1]/a
代码可能如下所示:
## imports
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
## count the number of pages you have
els = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='pagination ctm-pagination']/ul[1]/li[last()]/a"))).get_attribute("data-current-page")
## loop. at the end of the loop, click on the following page
for i in range(int(els)):
***scrape what you want***
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='pagination ctm-pagination']/ul[1]/li[last()-1]/a"))).click()
您的 while True
和 try-catch{}
逻辑非常接近。使用 and python you have to induce for element_to_be_clickable()
and you can use either of the following 转到下一页:
代码块:
driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
while True:
try:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(@class, 'state-active')]//following::li[1]/a[@href]"))).click()
print("Clicked for next page")
WebDriverWait(driver, 10).until(EC.staleness_of(driver.find_element_by_xpath("//a[contains(@class, 'state-active')]//following::li[1]/a[@href]")))
except (TimeoutException):
print("No more pages")
break
driver.quit()
控制台输出:
Clicked for next page
No more pages
我正在编写我的第一个真正的抓取工具,虽然总体上进展顺利,但我在使用 Selenium 时遇到了瓶颈。我无法让它转到下一页。
下面是我的代码的头部。下面的输出只是暂时在终端中打印出数据,一切正常。它只是在第 1 页末尾停止抓取并显示我的终端提示。它永远不会从第 2 页开始。如果有人能提出建议,我将不胜感激。我已经尝试选择页面底部的按钮,我试图使用相对和完整的 Xpath(你在这里看到完整的)来抓取,但都不起作用。我正在尝试单击右箭头按钮。
我内置了我自己的错误消息来指示驱动程序是否通过 Xpath 成功找到了元素。当我执行我的代码时会触发错误消息,所以我猜它没有找到元素。我只是不明白为什么不。
# Importing libraries
import requests
import csv
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Import selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException
import time
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome("/path/to/driver", options=options)
# Yes, I do have the actual path to my driver in the original code
driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
time.sleep(5)
while True:
try:
driver.find_element_by_xpath('/html/body/div[1]/div[3]/div/div/form/div[3]/div/div/ul[1]/li[4]/a').click()
except (TimeoutException, WebDriverException) as e:
print("A timeout or webdriver exception occurred.")
break
driver.quit()
您可以设置 Selenium expected conditions
(visibility_of_element_located
, element_to_be_clickable
) 并使用相对 XPath 到 select 下一个页面元素。所有这些都在一个循环中(它的范围是你必须处理的页数)。
下一页的 XPath link :
//div[@class='pagination ctm-pagination']/ul[1]/li[last()-1]/a
代码可能如下所示:
## imports
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
## count the number of pages you have
els = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='pagination ctm-pagination']/ul[1]/li[last()]/a"))).get_attribute("data-current-page")
## loop. at the end of the loop, click on the following page
for i in range(int(els)):
***scrape what you want***
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='pagination ctm-pagination']/ul[1]/li[last()-1]/a"))).click()
您的 while True
和 try-catch{}
逻辑非常接近。使用 element_to_be_clickable()
and you can use either of the following
代码块:
driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK") while True: try: WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(@class, 'state-active')]//following::li[1]/a[@href]"))).click() print("Clicked for next page") WebDriverWait(driver, 10).until(EC.staleness_of(driver.find_element_by_xpath("//a[contains(@class, 'state-active')]//following::li[1]/a[@href]"))) except (TimeoutException): print("No more pages") break driver.quit()
控制台输出:
Clicked for next page No more pages