没有从网页获取所有链接
Not getting all links from webpage
我正在从事网络抓取项目。我正在抓取的网站的 URL 是 https://www.beliani.de/sofas/ledersofa/
我正在抓取此页面上列出的所有产品的链接。我尝试使用 Requests-HTML 和 Selenium 获取链接。但我分别得到 57 和 24 个链接。虽然页面上列出了 150 多种产品。
以下是我正在使用的代码块。
使用硒:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep
options = Options()
options.add_argument("user-agent = Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36")
#path to crome driver
DRIVER_PATH = 'C:/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH, chrome_options=options)
url = 'https://www.beliani.de/sofas/ledersofa/'
driver.get(url)
sleep(20)
links = []
for a in driver.find_elements_by_xpath('//*[@id="offers_div"]/div/div/a'):
print(a)
links.append(a)
print(len(links))
使用请求-HTML:
from requests_html import HTMLSession
url = 'https://www.beliani.de/sofas/ledersofa/'
s = HTMLSession()
r = s.get(url)
r.html.render(sleep = 20)
products = r.html.xpath('//*[@id="offers_div"]', first = True)
#Getting 57 links using below block:
links = []
for link in products.absolute_links:
print(link)
links.append(link)
print(len(links))
我不知道我做错了哪一步或缺少了什么。
您必须滚动浏览网站并到达页面末尾才能加载网页中的所有脚本。只需打开网站,我们将只加载查看网页特定部分所需的脚本。因此,当您 运行 您的代码时,它只能从那些已加载的脚本中检索数据。
这个给了我 160 个链接 :
driver.get('https://www.beliani.de/sofas/ledersofa/')
sleep(3)
#gets the whole height of the document
height = driver.execute_script('return document.body.scrollHeight')
# now break the webpage into parts so that each section in the page is scrolled through to load
scroll_height = 0
for i in range(10):
scroll_height = scroll_height + (height/10)
driver.execute_script('window.scrollTo(0,arguments[0]);',scroll_height)
sleep(2)
# I have used the 'class' locator you can use anything you want once we have completed the loop
a_tags = driver.find_elements_by_class_name('itemBox')
count = 0
for i in a_tags:
if i.get_attribute('href') is not None:
print(i.get_attribute('href'))
count+=1
print(count)
driver.quit()
使用 and python you need to accept the cookies and you have to induce for visibility_of_all_elements_located()
and you can use either of the following 提取链接总数:
使用CSS_SELECTOR
:
driver.get("https://www.beliani.de/sofas/ledersofa/")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[value='Akzeptieren']"))).click()
print(len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div#offers_div > div > div > a[href]")))))
使用XPATH
:
driver.get("https://www.beliani.de/sofas/ledersofa/")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@value='Akzeptieren']"))).click()
print(len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@id='offers_div']/div/div/a[@href]")))))
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
我正在从事网络抓取项目。我正在抓取的网站的 URL 是 https://www.beliani.de/sofas/ledersofa/
我正在抓取此页面上列出的所有产品的链接。我尝试使用 Requests-HTML 和 Selenium 获取链接。但我分别得到 57 和 24 个链接。虽然页面上列出了 150 多种产品。 以下是我正在使用的代码块。
使用硒:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep
options = Options()
options.add_argument("user-agent = Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36")
#path to crome driver
DRIVER_PATH = 'C:/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH, chrome_options=options)
url = 'https://www.beliani.de/sofas/ledersofa/'
driver.get(url)
sleep(20)
links = []
for a in driver.find_elements_by_xpath('//*[@id="offers_div"]/div/div/a'):
print(a)
links.append(a)
print(len(links))
使用请求-HTML:
from requests_html import HTMLSession
url = 'https://www.beliani.de/sofas/ledersofa/'
s = HTMLSession()
r = s.get(url)
r.html.render(sleep = 20)
products = r.html.xpath('//*[@id="offers_div"]', first = True)
#Getting 57 links using below block:
links = []
for link in products.absolute_links:
print(link)
links.append(link)
print(len(links))
我不知道我做错了哪一步或缺少了什么。
您必须滚动浏览网站并到达页面末尾才能加载网页中的所有脚本。只需打开网站,我们将只加载查看网页特定部分所需的脚本。因此,当您 运行 您的代码时,它只能从那些已加载的脚本中检索数据。
这个给了我 160 个链接 :
driver.get('https://www.beliani.de/sofas/ledersofa/')
sleep(3)
#gets the whole height of the document
height = driver.execute_script('return document.body.scrollHeight')
# now break the webpage into parts so that each section in the page is scrolled through to load
scroll_height = 0
for i in range(10):
scroll_height = scroll_height + (height/10)
driver.execute_script('window.scrollTo(0,arguments[0]);',scroll_height)
sleep(2)
# I have used the 'class' locator you can use anything you want once we have completed the loop
a_tags = driver.find_elements_by_class_name('itemBox')
count = 0
for i in a_tags:
if i.get_attribute('href') is not None:
print(i.get_attribute('href'))
count+=1
print(count)
driver.quit()
使用visibility_of_all_elements_located()
and you can use either of the following
使用
CSS_SELECTOR
:driver.get("https://www.beliani.de/sofas/ledersofa/") WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[value='Akzeptieren']"))).click() print(len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div#offers_div > div > div > a[href]")))))
使用
XPATH
:driver.get("https://www.beliani.de/sofas/ledersofa/") WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@value='Akzeptieren']"))).click() print(len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@id='offers_div']/div/div/a[@href]")))))
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC