如何使用 Selenium Webdriver 从多个页面获取信息？

Question

我目前正在尝试从 Bonhams 网站 (https://www.bonhams.com/auctions/25281/?category=results#/!) 上提供的 'Hong Kong Watches 2.0' 拍卖的所有拍品（第 1 页到第 33 页）获取标题。我刚开始使用 python 和 selenium，但我尝试使用下面的代码获取结果。这段代码给了我想要的结果，但仅限于第 1 页。然后，代码不断重复第 1 页的结果。看起来点击下一页的循环不起作用。谁能帮我解决这个循环？

下面你可以找到我使用的代码：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions

driver=webdriver.Chrome()
driver.get('https://www.bonhams.com/auctions/25281/?category=results#/!')

while True:
    next_page_btn =driver.find_elements_by_xpath("//*[@id='lots']/div[2]/div[5]/div/a[10]/div")
    if len(next_page_btn) <1:
        print("no more pages left")
        break
    else:
        titles = driver.find_elements_by_xpath("//*[@class='firstLine']")
        titles = [title.text for title in titles]
        print(titles)

    element = WebDriverWait(driver,5).until(expected_conditions.element_to_be_clickable((By.ID,'lots')))
    driver.execute_script("return arguments[0].scrollIntoView();", element)
    element.click()

下面是我得到的输出。 Python 在此输出中一遍又一遍地保留 repeating/loading（我认为它这样做了 33 次？？？）。

['Hong Kong Watches 2.0', '', 'OMEGA. A Very Fine And Rare Limited Edition 
Yellow Gold Chronograph Bracelet Watch, Commemorating the Apollo 11 Space 
Mission And The Successful Moon Landing in 1969', '', '', '', 'ROLEX. TWO 
SETS OF SHOWCASE DISPLAYS, MADE FOR ROLEX RETAILERS IN 1970s', '', 'ROLEX. 
TWO SETS OF RARE SHOWCASE DISPLAYS, MADE FOR ROLEX RETAILERS IN 1980s', 
'', 'PATEK PHILIPPE. A SET OF THREE RARE LIMOGES PORCELAIN AND ENAMEL 
DISHES', '', 'Bvlgari/MAUBOUSSIN. TWO SETS OF CUFFLINKS', '', 
'BOUCHERON/MONTBLANC. TWO SETS OF CUFFLINKS', '', 'PATEK PHILIPPE. TWO 
SETS OF CUFFLINKS', '', 'Jaeger-LeCoultre. A Gilt Brass Table Clock With 
8-Days Power Reserve and Alarm', '', 'Cartier & LeCoultre. A group of 
three gilt brass table clocks (Alarm/Alarm Worldtime/Engraved dial)', '', 
'Jaeger-LeCoultre. A Gilt Brass Table Clock With 8-Days Power Reserve', 
'', 'Reuge. A Gold Plated Musical Automaton Open Face Pocket Watch with 
Alarm', '', 'Imhof. An Attractive Gilt Brass Table Clock With Polychrome 
Enamel Dial', '', 'Vacheron Constantin. A Large Polished Metal Perpetual 
Calendar Wall Clock']
['Hong Kong Watches 2.0', '', 'OMEGA. A Very Fine And Rare Limited Edition 
Yellow Gold Chronograph Bracelet Watch, Commemorating the Apollo 11 Space 
Mission And The Successful Moon Landing in 1969', '', '', '', 'ROLEX. TWO 
SETS OF SHOWCASE DISPLAYS, MADE FOR ROLEX RETAILERS IN 1970s', '', 'ROLEX. 
TWO SETS OF RARE SHOWCASE DISPLAYS, MADE FOR ROLEX RETAILERS IN 1980s', 
'', 'PATEK PHILIPPE. A SET OF THREE RARE LIMOGES PORCELAIN AND ENAMEL 
DISHES', '', 'Bvlgari/MAUBOUSSIN. TWO SETS OF CUFFLINKS', '', 
'BOUCHERON/MONTBLANC. TWO SETS OF CUFFLINKS', '', 'PATEK PHILIPPE. TWO 
SETS OF CUFFLINKS', '', 'Jaeger-LeCoultre. A Gilt Brass Table Clock With 
8-Days Power Reserve and Alarm', '', 'Cartier & LeCoultre. A group of 
three gilt brass table clocks (Alarm/Alarm Worldtime/Engraved dial)', '', 
'Jaeger-LeCoultre. A Gilt Brass Table Clock With 8-Days Power Reserve', 
'', 'Reuge. A Gold Plated Musical Automaton Open Face Pocket Watch with 
Alarm', '', 'Imhof. An Attractive Gilt Brass Table Clock With Polychrome 
Enamel Dial', '', 'Vacheron Constantin. A Large Polished Metal Perpetual 
Calendar Wall Clock']

Answer 1

不需要 selenium 库来抓取数据。您还可以使用 requests 和 BeautifulSoup 库获取所有页面数据。

import requests
from bs4 import BeautifulSoup

headers = {
       "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0",
       "Accept": "application/json"
   }

page_num = 1
title_list = []

while True:
    url = 'https://www.bonhams.com/api/v1/lots/25281/?category=results&length=12&minimal=false&page={}'.format(page_num)
    print("===url===",url)
    response = requests.get(url,headers=headers).json()
    max_lot = response['max_lot']
    last_iSaleLotNo = 0
    titles = []
    for lot in response['lots']:
        last_iSaleLotNo = lot['lot_id_combined']
        title = BeautifulSoup(lot['styled_title'], 'lxml').find("div",{'class':'firstLine'}).text.strip()
        titles.append(title)

    title_list.append(titles)
    print("===titles===",titles)
    if int(max_lot) == int(last_iSaleLotNo):
        break

    page_num+=1

print(title_list)

First-page o/p:

['ROLEX. TWO SETS OF SHOWCASE DISPLAYS, MADE FOR ROLEX RETAILERS IN 1970s', 'ROLEX. TWO SETS OF RARE SHOWCASE DISPLAYS, MADE FOR ROLEX RETAILERS IN 1980s', 'PATEK PHILIPPE. A SET OF THREE RARE LIMOGES PORCELAIN AND ENAMEL DISHES', 'Bvlgari/MAUBOUSSIN. TWO SETS OF CUFFLINKS', 'BOUCHERON/MONTBLANC. TWO SETS OF CUFFLINKS', 'PATEK PHILIPPE. TWO SETS OF CUFFLINKS', 'Jaeger-LeCoultre. A Gilt Brass Table Clock With 8-Days Power Reserve and Alarm', 'Cartier & LeCoultre. A group of three gilt brass table clocks (Alarm/Alarm Worldtime/Engraved dial)', 'Jaeger-LeCoultre. A Gilt Brass Table Clock With 8-Days Power Reserve', 'Reuge. A Gold Plated Musical Automaton Open Face Pocket Watch with Alarm', 'Imhof. An Attractive Gilt Brass Table Clock With Polychrome Enamel Dial', 'Vacheron Constantin. A Large Polished Metal Perpetual Calendar Wall Clock']

打开浏览器网络选项卡并单击下一步按钮，您将看到 JSON 响应数据，例如

如何使用 Selenium Webdriver 从多个页面获取信息？

How do I grab information from multiple pages using Selenium Webdriver?

python

selenium

selenium-chromedriver

selenium-webdriver

webdriverwait