使用 beautifulsoup 和 webdriver 抓取下一页的问题
The problem in crawl the every next page using beautifulsoup and webdriver
我正在尝试使用 BeautifulSoup
和 Selenium
从 https://www.vietnamworks.com/tim-viec-lam/tat-ca-viec-lam 抓取所有 link 的作业。
问题是我只能抓取第一页的 link,不知道如何抓取下一页的 link。
这是我试过的代码:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
import time
import requests
from bs4 import BeautifulSoup
import array as arr
import pandas as pd
#The first line import the Web Driver, and the second import Chrome Options
#-----------------------------------#
#Chrome Options
all_link = []
chrome_options = Options()
chrome_options.add_argument ('--ignore-certificate-errors')
chrome_options.add_argument ("--igcognito")
chrome_options.add_argument ("--window-size=1920x1080")
chrome_options.add_argument ('--headless')
#-----------------------------------#
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path="C:/webdriver/chromedriver.exe")
#Open url
url = "https://www.vietnamworks.com/tim-viec-lam/tat-ca-viec-lam"
driver.get(url)
time.sleep(2)
#-----------------------------------#
page_source = driver.page_source
page = page_source
soup = BeautifulSoup(page_source,"html.parser")
block_job_list = soup.find_all("div",{"class":"d-flex justify-content-center align-items-center logo-area-wrapper logo-border"})
for i in block_job_list:
link = i.find("a")
all_link.append("https://www.vietnamworks.com/"+ link.get("href"))
由于您的问题是遍历页面,此代码将帮助您做到这一点。如前所述,在 while
循环中插入抓取代码。
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
from webdriver_manager.chrome import ChromeDriverManager # use pip install webdriver_manager if not installed
option = webdriver.ChromeOptions()
CDM = ChromeDriverManager()
driver = webdriver.Chrome(CDM.install(),options=option)
url = 'https://www.vietnamworks.com/tim-viec-lam/tat-ca-viec-lam'
driver.get(url)
time.sleep(3)
page_num = 1
links = []
driver.execute_script("window.scrollTo(0, document.body.scrollHeight/2);")
while True:
# create the soup element here so that it can get the page source of every page
# sample scraping of url's of the jobs posted
for i in driver.find_elements_by_class_name('job-title '):
links.append(i.get_attribute('href'))
# moves to next page
try:
print(f'On page {str(page_num)}')
print()
page_num+=1
driver.find_element_by_link_text(str(page_num)).click()
time.sleep(3)
# checks only at the end of the page
except NoSuchElementException:
print('End of pages')
break
driver.quit()
编辑:
- 简化并修改分页方式
- 如果您使用
BeautifulSoup
,那么您必须在 while
循环中插入 page_source
和 soup
变量,因为在每次迭代后,源页面代码都会更改.在您的代码中,您只提取了第一页的源代码,因此您得到了重复输出 ,等于页数。
- 通过在包
webdriver-manager
中使用 ChromeDriverManager
,不需要提及 location/executable 路径。您可以复制粘贴此代码并将其 运行 复制到任何安装了 Chrome 的机器中。如果您必须在 运行 代码之前在 cmd 中使用 pip install webdriver_manager
安装。
警告:避免显示您的任何帐户的实际 用户名 和 密码,例如您的 GitHub
代码中有。
我正在尝试使用 BeautifulSoup
和 Selenium
从 https://www.vietnamworks.com/tim-viec-lam/tat-ca-viec-lam 抓取所有 link 的作业。
问题是我只能抓取第一页的 link,不知道如何抓取下一页的 link。
这是我试过的代码:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
import time
import requests
from bs4 import BeautifulSoup
import array as arr
import pandas as pd
#The first line import the Web Driver, and the second import Chrome Options
#-----------------------------------#
#Chrome Options
all_link = []
chrome_options = Options()
chrome_options.add_argument ('--ignore-certificate-errors')
chrome_options.add_argument ("--igcognito")
chrome_options.add_argument ("--window-size=1920x1080")
chrome_options.add_argument ('--headless')
#-----------------------------------#
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path="C:/webdriver/chromedriver.exe")
#Open url
url = "https://www.vietnamworks.com/tim-viec-lam/tat-ca-viec-lam"
driver.get(url)
time.sleep(2)
#-----------------------------------#
page_source = driver.page_source
page = page_source
soup = BeautifulSoup(page_source,"html.parser")
block_job_list = soup.find_all("div",{"class":"d-flex justify-content-center align-items-center logo-area-wrapper logo-border"})
for i in block_job_list:
link = i.find("a")
all_link.append("https://www.vietnamworks.com/"+ link.get("href"))
由于您的问题是遍历页面,此代码将帮助您做到这一点。如前所述,在 while
循环中插入抓取代码。
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
from webdriver_manager.chrome import ChromeDriverManager # use pip install webdriver_manager if not installed
option = webdriver.ChromeOptions()
CDM = ChromeDriverManager()
driver = webdriver.Chrome(CDM.install(),options=option)
url = 'https://www.vietnamworks.com/tim-viec-lam/tat-ca-viec-lam'
driver.get(url)
time.sleep(3)
page_num = 1
links = []
driver.execute_script("window.scrollTo(0, document.body.scrollHeight/2);")
while True:
# create the soup element here so that it can get the page source of every page
# sample scraping of url's of the jobs posted
for i in driver.find_elements_by_class_name('job-title '):
links.append(i.get_attribute('href'))
# moves to next page
try:
print(f'On page {str(page_num)}')
print()
page_num+=1
driver.find_element_by_link_text(str(page_num)).click()
time.sleep(3)
# checks only at the end of the page
except NoSuchElementException:
print('End of pages')
break
driver.quit()
编辑:
- 简化并修改分页方式
- 如果您使用
BeautifulSoup
,那么您必须在while
循环中插入page_source
和soup
变量,因为在每次迭代后,源页面代码都会更改.在您的代码中,您只提取了第一页的源代码,因此您得到了重复输出 ,等于页数。 - 通过在包
webdriver-manager
中使用ChromeDriverManager
,不需要提及 location/executable 路径。您可以复制粘贴此代码并将其 运行 复制到任何安装了 Chrome 的机器中。如果您必须在 运行 代码之前在 cmd 中使用pip install webdriver_manager
安装。
警告:避免显示您的任何帐户的实际 用户名 和 密码,例如您的 GitHub
代码中有。