循环页面并抓取 Python 中的内容

Question

我想抓取来自this link的内容：

如何循环所有页面并爬取红色圆圈中的所有元素？谢谢。

代码：

from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse

url = 'http://www.eoechina.com.cn/cn2019/gonggaoxinxi.html?classID=1'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup)

Answer 1

有一个端点可以查询以遍历页面。

方法如下：

from urllib.parse import urlencode
import requests
import pandas as pd

end_point = "http://www.eoechina.com.cn/cn2016/mobile/GetArticleList.ashx"

payload = {
    "pageNumber": 1,
    "classID": 1,
    "searchKey": "",
    "selectItemID": "72,"
}

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:86.0) "
                  "Gecko/20100101 Firefox/86.0",
    "X-Requested-With": "XMLHttpRequest",
}

for page in range(1, 5):
    payload["pageNumber"] = page
    response = requests.post(
        end_point,
        data=urlencode(payload),
        headers=headers,
    ).json()
    # print("\n".join(item["title"] for item in response))
    df = pd.DataFrame(response)
    print(df)

示例输出：（这是一个屏幕截图，因为 SO 认为输出是垃圾邮件...）

循环页面并抓取 Python 中的内容

Loop pages and crawler the contents in Python

beautifulsoup

web-crawler

web-scraping

python-3.x

python-requests