访问第一个元素后，无法通过循环中的 xpaths 访问剩余元素 - Webscraping Selenium Python

Question

我正在尝试从 sciencedirect 网站上抓取数据。我试图通过创建一个 xpaths 列表并循环它们来一个接一个地访问期刊问题来自动化抓取过程。当我运行访问第一个日志后，循环无法访问其余元素。这个过程在另一个网站上对我有效，但在这个网站上无效。

我也想知道除了这个过程之外，还有什么更好的方法来访问这些元素。

#Importing libraries
 import requests
 import os
 import json
 from selenium import webdriver
 import pandas as pd
 from bs4 import BeautifulSoup  
 import time
 import requests
 from time import sleep

 from selenium.webdriver.common.by import By
 from selenium.webdriver.support.ui import WebDriverWait
 from selenium.webdriver.support import expected_conditions as EC

 #initializing the chromewebdriver|
 driver=webdriver.Chrome(executable_path=r"C:/selenium/chromedriver.exe")

 #website to be accessed
 driver.get("https://www.sciencedirect.com/journal/journal-of-corporate-finance/issues")

 #generating the list of xpaths to be accessed one after the other
 issues=[]
 for i in range(0,20):
     docs=(str(i))
     for j in range(1,7):
         sets=(str(j))
         con=("//*[@id=")+('"')+("0-accordion-panel-")+(docs)+('"')+("]/section/div[")+(sets)+("]/a")
         issues.append(con)

 #looping to access one issue after the other
 for i in issues:
     try:
         hat=driver.find_element_by_xpath(i)
         hat.click()
         sleep(4)
         driver.back()
     except:
         print("no more issues",i)

Answer 1

要从 sciencedirect 网站抓取数据 https://www.sciencedirect.com/journal/journal-of-corporate-finance/issues 您可以执行以下步骤：

先打开所有手风琴
然后使用 Ctrl + click().
打开调整器中的每个问题
下一步并抓取所需内容。

代码块：

  from selenium import webdriver
  from selenium.webdriver.common.by import By
  from selenium.webdriver.support.ui import WebDriverWait
  from selenium.webdriver.support import expected_conditions as EC
  from selenium.webdriver.common.action_chains import ActionChains
  from selenium.webdriver.common.keys import Keys

  options = webdriver.ChromeOptions() 
  options.add_argument("start-maximized")
  options.add_experimental_option("excludeSwitches", ["enable-automation"])
  options.add_experimental_option('useAutomationExtension', False)
  driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
  driver.get('https://www.sciencedirect.com/journal/journal-of-corporate-finance/issues')
  accordions = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "li.accordion-panel.js-accordion-panel>button.accordion-panel-title>span")))
  for accordion in accordions:
      ActionChains(driver).move_to_element(accordion).click(accordion).perform()
  issues = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.anchor.js-issue-item-link.text-m span.anchor-text")))
  windows_before  = driver.current_window_handle
  for issue in issues:
      ActionChains(driver).key_down(Keys.CONTROL).click(issue).key_up(Keys.CONTROL).perform()
      WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
      windows_after = driver.window_handles
      new_window = [x for x in windows_after if x != windows_before][0]
      driver.switch_to_window(new_window)
      WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a#journal-title>span")))
      print(WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, "//h2"))).get_attribute("innerHTML"))
      driver.close()
      driver.switch_to_window(windows_before)
  driver.quit()

控制台输出：

  Institutions, Governance and Finance in a Globally Connected Environment
  Volume 58
  Corporate Governance in Multinational Enterprises
  .
  .
  .

参考资料

您可以在以下位置找到一些相关的详细讨论：

StaleElementReferenceException even after adding the wait while collecting the data from the wikipedia using web-scraping

访问第一个元素后，无法通过循环中的 xpaths 访问剩余元素 - Webscraping Selenium Python

Unable to access the remaining elements by xpaths in a loop after accessing the first element- Webscraping Selenium Python

python

selenium

web-scraping

selenium-webdriver

webdriverwait

参考资料