如何使用 Beautifulsoup 抓取 python 中的下一页

Question

假设我正在抓取 url

http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha

它没有包含我要抓取的数据的页面。那么我怎样才能抓取所有下一页的数据。我正在使用 python 3.5.1 和 Beautifulsoup。注意：我不能使用 scrapy 和 lxml，因为它给我一些安装错误。

Answer 1

通过提取 "Go to the last page" 元素的 page 参数来确定最后一页。并通过 requests.Session():

遍历每个维护网络抓取会话的页面

import re

import requests
from bs4 import BeautifulSoup


with requests.Session() as session:
    # extract the last page
    response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha")    
    soup = BeautifulSoup(response.content, "html.parser")
    last_page = int(re.search("page=(\d+)", soup.select_one("li.pager-last").a["href"]).group(1))

    # loop over every page
    for page in range(last_page):
        response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha&page=%f" % page)
        soup = BeautifulSoup(response.content, "html.parser")

        # print the title of every search result
        for result in soup.select("li.search-result"):
            title = result.find("div", class_="title").get_text(strip=True)
            print(title)

打印：

A C S College of Engineering, Bangalore
A1 Global Institute of Engineering and Technology, Prakasam
AAA College of Engineering and Technology, Thiruthangal
...

如何使用 Beautifulsoup 抓取 python 中的下一页

How to scrape the next pages in python using Beautifulsoup

html

python

beautifulsoup

html-parsing

web-scraping