如何从 slider/slideshow 抓取图像?

How to scrape images from slider/slideshow?

所以有这个电子商务页面 https://www.jooraccess.com/r/products?token=feba69103f6c9789270a1412954cf250 并且有数百种产品,并且对于每种产品都有一个带有图像的滑块(或幻灯片或您称之为的任何东西)。我只需要从页面上抓取所有图像。我知道如何抓取每个滑块中的第一张图像,但我不知道如何抓取每个滑块中的其余图像。

我检查了元素并注意到每次更改滑块中的图像时,这部分

<div data-position="4" class="PhotoBreadcrumb_active__2T6z2 PhotoBreadcrumb_dot__2PbsQ"></div> 

向下移动这些位置(在下面的示例中选择了图像#4)

<div class="PhotoBreadcrumb_breadcrumbContainer__2cALf" data-testid="breadcrumbContainer">
    <div data-position="0" class="PhotoBreadcrumb_dot__2PbsQ"></div>
    <div data-position="1" class="PhotoBreadcrumb_dot__2PbsQ"></div>
    <div data-position="2" class="PhotoBreadcrumb_dot__2PbsQ"></div>
    <div data-position="3" class="PhotoBreadcrumb_dot__2PbsQ"></div>
    <div data-position="4" class="PhotoBreadcrumb_active__2T6z2 PhotoBreadcrumb_dot__2PbsQ"></div>
    <div data-position="5" class="PhotoBreadcrumb_dot__2PbsQ"></div>
</div>

您无法自动收集所有这些图像。
每个产品每次仅显示 1 张图片并存在于页面上。
要更改图片/加载另一张图片,您必须单击每个产品下方的缩略图单选按钮。这会导致一些 JS 为该产品加载另一个图像。
换句话说,其他未显示的图像在通过单击单选按钮加载之前不存在于页面上 - 每个产品下方的缩略图。

要从第一张幻灯片中抓取 src 属性的所有值,您需要:

  • 点击每张幻灯片 WebDriverWait for the

  • 收集每个src属性的值诱导 for the

  • 您可以使用以下:

    driver.get("https://www.jooraccess.com/r/products?token=feba69103f6c9789270a1412954cf250")
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[contains(@class, 'Grid_Row__2R-IV') and contains(@class, 'Grid_left')]/div//img"))).get_attribute("src"))
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[contains(@class, 'Grid_Row__2R-IV') and contains(@class, 'Grid_left')]/div//div[@data-position='1']"))).click()
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[contains(@class, 'Grid_Row__2R-IV') and contains(@class, 'Grid_left')]/div//img"))).get_attribute("src"))
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[contains(@class, 'Grid_Row__2R-IV') and contains(@class, 'Grid_left')]/div//div[@data-position='2']"))).click()
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[contains(@class, 'Grid_Row__2R-IV') and contains(@class, 'Grid_left')]/div//img"))).get_attribute("src"))
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[contains(@class, 'Grid_Row__2R-IV') and contains(@class, 'Grid_left')]/div//div[@data-position='3']"))).click()
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[contains(@class, 'Grid_Row__2R-IV') and contains(@class, 'Grid_left')]/div//img"))).get_attribute("src"))
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[contains(@class, 'Grid_Row__2R-IV') and contains(@class, 'Grid_left')]/div//div[@data-position='3']"))).click()
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[contains(@class, 'Grid_Row__2R-IV') and contains(@class, 'Grid_left')]/div//img"))).get_attribute("src"))
    
  • 控制台输出:

    https://cdn.jooraccess.com/img/uploads/accounts/678917/images/Sundays_NYC_3202%20(1).jpg
    https://cdn.jooraccess.com/img/uploads/accounts/678917/images/Sundays_NYC_3207.jpg
    https://cdn.jooraccess.com/img/uploads/accounts/678917/images/Maya%20dress_Floral03.jpg
    https://cdn.jooraccess.com/img/uploads/accounts/678917/images/Maya%20dress_Floral04.jpg
    https://cdn.jooraccess.com/img/uploads/accounts/678917/images/Maya%20dress_Floral05.jpg