使用硒提取网站的面包屑

Extract the breadcrumbs of a website using selenium

我需要提取此网站的面包屑:https://www.woolworths.com.au/Shop/Browse/drinks/cordials-juices-iced-teas/iced-teas

我试图检查元素并复制 xpath,但它没有提取它

from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.woolworths.com.au/Shop/Browse/drinks/cordials-juices-iced-teas/iced-teas')
driver.find_elements_by_xpath('//*[@id="center-panel"]/div/wow-tile-list-with-content/ng-transclude/wow-browse-tile-list/wow-tile-list/div/div[1]/div[1]/wow-breadcrumbs/div/ul/li[4]/span/span')

driver.find_element_by_css_selector('#center-panel > div > wow-tile-list-with-content > ng-transclude > wow-browse-tile-list > wow-tile-list > div > div.tileList > div.tileList-headerContainer > wow-breadcrumbs > div > ul > li:nth-child(4) > span > span')

我该如何继续?

您尝试抓取的页面是用 Angular 编写的,这意味着大多数 DOM element 是由 JavaScript AJAX 代码动态加载的,它们是页面加载后不存在。 (driver.get 函数 returns)

您应该使用 waits until 函数来定位这些元素。

这是使用您提供的 XPATH 的工作示例:

driver.get('https://www.woolworths.com.au/Shop/Browse/drinks/cordials-juices-iced-teas/iced-teas')
try:
    element = WebDriverWait(driver, 1).until(
        EC.presence_of_element_located((By.XPATH, '//*[@id="center-panel"]/div/wow-tile-list-with-content/ng-transclude/wow-browse-tile-list/wow-tile-list/div/div[1]/div[1]/wow-breadcrumbs/div/ul/li[4]/span/span'))
    )
    print(element.text) ' this outputs Iced Teas
except TimeoutException:
    print("Timeout")

要打印网站的面包屑:https://www.woolworths.com.au/Shop/Browse/drinks/cordials-juices-iced-teas/iced-teas you have to induce WebDriverWait for the desired visibility_of_element_located() and you can use either of the following :

  • 使用CSS_SELECTORget_attribute()方法:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "ul.breadcrumbs-linkList li:nth-child(4) span span"))).get_attribute("innerHTML"))
    
  • 使用 XPATHtext 属性:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//ul[@class='breadcrumbs-linkList']//following-sibling::li[4]//span//span"))).text)
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

结尾

根据文档:

  • get_attribute()方法Gets the given attribute or property of the element.
  • text属性returnsThe text of the element.
  • Difference between text and innerHTML using Selenium

下面一个用于我的验证

//*[span='first text' and span='Search results for "second text"']