Selenium 不会进入刮板中的下一页

Selenium not going to next page in scraper

我正在编写我的第一个真正的抓取工具,虽然总体上进展顺利,但我在使用 Selenium 时遇到了瓶颈。我无法让它转到下一页。

下面是我的代码的头部。下面的输出只是暂时在终端中打印出数据,一切正常。它只是在第 1 页末尾停止抓取并显示我的终端提示。它永远不会从第 2 页开始。如果有人能提出建议,我将不胜感激。我已经尝试选择页面底部的按钮,我试图使用相对和完整的 Xpath(你在这里看到完整的)来抓取,但都不起作用。我正在尝试单击右箭头按钮。

我内置了我自己的错误消息来指示驱动程序是否通过 Xpath 成功找到了元素。当我执行我的代码时会触发错误消息,所以我猜它没有找到元素。我只是不明白为什么不。

# Importing libraries
import requests
import csv
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

# Import selenium 
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException
import time

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome("/path/to/driver", options=options)
# Yes, I do have the actual path to my driver in the original code

driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
time.sleep(5)
while True:
    try:
        driver.find_element_by_xpath('/html/body/div[1]/div[3]/div/div/form/div[3]/div/div/ul[1]/li[4]/a').click()
    except (TimeoutException, WebDriverException) as e:
        print("A timeout or webdriver exception occurred.")
        break
driver.quit()

您可以设置 Selenium expected conditions (visibility_of_element_located, element_to_be_clickable) 并使用相对 XPath 到 select 下一个页面元素。所有这些都在一个循环中(它的范围是你必须处理的页数)。

下一页的 XPath link :

//div[@class='pagination ctm-pagination']/ul[1]/li[last()-1]/a

代码可能如下所示:

## imports

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")

## count the number of pages you have

els = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='pagination ctm-pagination']/ul[1]/li[last()]/a"))).get_attribute("data-current-page")

## loop. at the end of the loop, click on the following page

for i in range(int(els)):
    ***scrape what you want***
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='pagination ctm-pagination']/ul[1]/li[last()-1]/a"))).click()

您的 while Truetry-catch{} 逻辑非常接近。使用 and you have to induce for element_to_be_clickable() and you can use either of the following 转到下一页:

  • 代码块:

    driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
    while True:
        try:
            WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(@class, 'state-active')]//following::li[1]/a[@href]"))).click()
            print("Clicked for next page")
            WebDriverWait(driver, 10).until(EC.staleness_of(driver.find_element_by_xpath("//a[contains(@class, 'state-active')]//following::li[1]/a[@href]")))
        except (TimeoutException):
            print("No more pages")
            break
    driver.quit()
    
  • 控制台输出:

    Clicked for next page
    No more pages