如何使用 Beautifulsoup 抓取 python 中的下一页
How to scrape the next pages in python using Beautifulsoup
假设我正在抓取 url
http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha
它没有包含我要抓取的数据的页面。那么我怎样才能抓取所有下一页的数据。
我正在使用 python 3.5.1 和 Beautifulsoup。
注意:我不能使用 scrapy 和 lxml,因为它给我一些安装错误。
通过提取 "Go to the last page" 元素的 page
参数来确定最后一页。并通过 requests.Session()
:
遍历每个维护网络抓取会话的页面
import re
import requests
from bs4 import BeautifulSoup
with requests.Session() as session:
# extract the last page
response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha")
soup = BeautifulSoup(response.content, "html.parser")
last_page = int(re.search("page=(\d+)", soup.select_one("li.pager-last").a["href"]).group(1))
# loop over every page
for page in range(last_page):
response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha&page=%f" % page)
soup = BeautifulSoup(response.content, "html.parser")
# print the title of every search result
for result in soup.select("li.search-result"):
title = result.find("div", class_="title").get_text(strip=True)
print(title)
打印:
A C S College of Engineering, Bangalore
A1 Global Institute of Engineering and Technology, Prakasam
AAA College of Engineering and Technology, Thiruthangal
...
假设我正在抓取 url
http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha
它没有包含我要抓取的数据的页面。那么我怎样才能抓取所有下一页的数据。 我正在使用 python 3.5.1 和 Beautifulsoup。 注意:我不能使用 scrapy 和 lxml,因为它给我一些安装错误。
通过提取 "Go to the last page" 元素的 page
参数来确定最后一页。并通过 requests.Session()
:
import re
import requests
from bs4 import BeautifulSoup
with requests.Session() as session:
# extract the last page
response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha")
soup = BeautifulSoup(response.content, "html.parser")
last_page = int(re.search("page=(\d+)", soup.select_one("li.pager-last").a["href"]).group(1))
# loop over every page
for page in range(last_page):
response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha&page=%f" % page)
soup = BeautifulSoup(response.content, "html.parser")
# print the title of every search result
for result in soup.select("li.search-result"):
title = result.find("div", class_="title").get_text(strip=True)
print(title)
打印:
A C S College of Engineering, Bangalore
A1 Global Institute of Engineering and Technology, Prakasam
AAA College of Engineering and Technology, Thiruthangal
...