Python Selenium 不会 select 所有图片标签
Python Selenium won't select all image tags
我正在尝试使用 Selenium 抓取 Product Hunt
更具体地说,我正在尝试获取所有产品图标的来源 link。
HTML:
我的抓取代码如下:
driver = webdriver.Chrome("<Your driver's path>")
driver.get("https://www.producthunt.com/topics/seo-tools?order=most-upvoted")
time.sleep(4)
icons = driver.find_elements_by_css_selector("div.styles_thumbnail__d2DAK.styles_thumbnail__XBHZ_ img")
print(len(icons))
print(icons)
driver.close()
问题是 selenium 只获取前 3 张图片,而不是所有可用的产品。
我已经尝试增加睡眠时间并实施了 driver.wait 方法以及 EC.presence_of_all_elements_located
以确保所有图标都正确加载。
由于滚动到页面底部时会显示其他图标,因此您可以这样做
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.producthunt.com/topics/seo-tools?order=most-upvoted")
expected_number_of_icons = 20
icons = []
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
icons = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[contains(@data-test, 'post-item')]//div[@class='styles_thumbnail__d2DAK styles_thumbnail__XBHZ_']//img | //div[contains(@class, 'styles_link')]//span[@class='lazyload-wrapper']/img")))
icons = list(set(icons))
if len(icons) > expected_number_of_icons:
break
icons = icons[:expected_number_of_icons]
driver.close()
当你达到你想要的图标数量时,你选择停止的地方。显然,例如,如果您达到 210 个图标而您只想要 200 个图标,您可以丢弃列表的最后 10 个元素
要打印 src 属性的值,您可以使用以下任一方法 :
使用css_selector
:
print([my_elem.get_attribute("src") for my_elem in driver.find_elements_by_css_selector("span.lazyload-wrapper > img")])
使用xpath
:
print([my_elem.get_attribute("src") for my_elem in driver.find_elements_by_xpath("//span[@class='lazyload-wrapper']/img")])
理想情况下,您必须为 visibility_of_all_elements_located()
引入 ,您可以使用以下任一方法 :
使用CSS_SELECTOR
:
driver.get('https://www.producthunt.com/topics/seo-tools?order=most-upvoted')
print([my_elem.get_attribute("src") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "span.lazyload-wrapper > img")))])
在一行中使用XPATH
:
driver.get('https://www.producthunt.com/topics/seo-tools?order=most-upvoted')
print([my_elem.get_attribute("src") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[@class='lazyload-wrapper']/img")))])
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
我正在尝试使用 Selenium 抓取 Product Hunt
更具体地说,我正在尝试获取所有产品图标的来源 link。
HTML:
我的抓取代码如下:
driver = webdriver.Chrome("<Your driver's path>")
driver.get("https://www.producthunt.com/topics/seo-tools?order=most-upvoted")
time.sleep(4)
icons = driver.find_elements_by_css_selector("div.styles_thumbnail__d2DAK.styles_thumbnail__XBHZ_ img")
print(len(icons))
print(icons)
driver.close()
问题是 selenium 只获取前 3 张图片,而不是所有可用的产品。
我已经尝试增加睡眠时间并实施了 driver.wait 方法以及 EC.presence_of_all_elements_located
以确保所有图标都正确加载。
由于滚动到页面底部时会显示其他图标,因此您可以这样做
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.producthunt.com/topics/seo-tools?order=most-upvoted")
expected_number_of_icons = 20
icons = []
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
icons = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[contains(@data-test, 'post-item')]//div[@class='styles_thumbnail__d2DAK styles_thumbnail__XBHZ_']//img | //div[contains(@class, 'styles_link')]//span[@class='lazyload-wrapper']/img")))
icons = list(set(icons))
if len(icons) > expected_number_of_icons:
break
icons = icons[:expected_number_of_icons]
driver.close()
当你达到你想要的图标数量时,你选择停止的地方。显然,例如,如果您达到 210 个图标而您只想要 200 个图标,您可以丢弃列表的最后 10 个元素
要打印 src 属性的值,您可以使用以下任一方法
使用
css_selector
:print([my_elem.get_attribute("src") for my_elem in driver.find_elements_by_css_selector("span.lazyload-wrapper > img")])
使用
xpath
:print([my_elem.get_attribute("src") for my_elem in driver.find_elements_by_xpath("//span[@class='lazyload-wrapper']/img")])
理想情况下,您必须为 visibility_of_all_elements_located()
引入
使用
CSS_SELECTOR
:driver.get('https://www.producthunt.com/topics/seo-tools?order=most-upvoted') print([my_elem.get_attribute("src") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "span.lazyload-wrapper > img")))])
在一行中使用
XPATH
:driver.get('https://www.producthunt.com/topics/seo-tools?order=most-upvoted') print([my_elem.get_attribute("src") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[@class='lazyload-wrapper']/img")))])
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC