使用 selenium 解析动态网页
Parsing a dynamic webpage using selenium
我正在尝试从亚马逊抓取图片,这并不容易。
我想我快到了,但我没有得到结果。
在这里,我使用 selenium 来 1. 打开主图像 2. 单击缩略图中的第二张图像
3.然后获取第二张图片全尺寸的src。
但它失败了,我不知道为什么
这是我写的台词
from urllib.request import urlretrieve
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Firefox()
url = "https://www.amazon.com/Kraft-Original-Macaroni-Microwaveable-Packets/dp/B005ECO3H0"
driver.get(url)
action = ActionChains(driver)
time.sleep(5)
driver.find_element_by_css_selector('#landingImage').click()
time.sleep(10)
html = driver.page_source
soup = BeautifulSoup(html,"html.parser")
driver.find_element_by_css_selector('#ivImage_1').click()
amazon = soup.select_one(".fullscreen")
imgUrl = amazon.find("img")['src']
print(imgUrl)
我无法理解的一件事是,如果我键入 print(amazon),它会给我 img 标签,但根据上面代码的结果,imgUrl 是 'Nonetype'。
请帮我找到答案。
给你
from urllib.request import urlretrieve
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Firefox()
url = "https://www.amazon.com/Kraft-Original-Macaroni-Microwaveable-
Packets/dp/B005ECO3H0"
driver.get(url)
action = ActionChains(driver)
time.sleep(5)
driver.find_element_by_css_selector('#landingImage').click()
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html,"html.parser")
driver.find_element_by_css_selector('#ivImage_1').click()
image_url = driver.find_element_by_class_name("fullscreen").get_attribute("src")
print(image_url)
#if you want to download
import requests
resp = requests.get(image_url)
with open("asd.png", "wb")as image:
image.write(resp.content)
我正在尝试从亚马逊抓取图片,这并不容易。
我想我快到了,但我没有得到结果。
在这里,我使用 selenium 来 1. 打开主图像 2. 单击缩略图中的第二张图像 3.然后获取第二张图片全尺寸的src。
但它失败了,我不知道为什么
这是我写的台词
from urllib.request import urlretrieve
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Firefox()
url = "https://www.amazon.com/Kraft-Original-Macaroni-Microwaveable-Packets/dp/B005ECO3H0"
driver.get(url)
action = ActionChains(driver)
time.sleep(5)
driver.find_element_by_css_selector('#landingImage').click()
time.sleep(10)
html = driver.page_source
soup = BeautifulSoup(html,"html.parser")
driver.find_element_by_css_selector('#ivImage_1').click()
amazon = soup.select_one(".fullscreen")
imgUrl = amazon.find("img")['src']
print(imgUrl)
我无法理解的一件事是,如果我键入 print(amazon),它会给我 img 标签,但根据上面代码的结果,imgUrl 是 'Nonetype'。
请帮我找到答案。
给你
from urllib.request import urlretrieve
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Firefox()
url = "https://www.amazon.com/Kraft-Original-Macaroni-Microwaveable-
Packets/dp/B005ECO3H0"
driver.get(url)
action = ActionChains(driver)
time.sleep(5)
driver.find_element_by_css_selector('#landingImage').click()
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html,"html.parser")
driver.find_element_by_css_selector('#ivImage_1').click()
image_url = driver.find_element_by_class_name("fullscreen").get_attribute("src")
print(image_url)
#if you want to download
import requests
resp = requests.get(image_url)
with open("asd.png", "wb")as image:
image.write(resp.content)