循环页面并抓取 Python 中的内容
Loop pages and crawler the contents in Python
我想抓取来自this link的内容:
如何循环所有页面并爬取红色圆圈中的所有元素?谢谢。
代码:
from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse
url = 'http://www.eoechina.com.cn/cn2019/gonggaoxinxi.html?classID=1'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup)
有一个端点可以查询以遍历页面。
方法如下:
from urllib.parse import urlencode
import requests
import pandas as pd
end_point = "http://www.eoechina.com.cn/cn2016/mobile/GetArticleList.ashx"
payload = {
"pageNumber": 1,
"classID": 1,
"searchKey": "",
"selectItemID": "72,"
}
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:86.0) "
"Gecko/20100101 Firefox/86.0",
"X-Requested-With": "XMLHttpRequest",
}
for page in range(1, 5):
payload["pageNumber"] = page
response = requests.post(
end_point,
data=urlencode(payload),
headers=headers,
).json()
# print("\n".join(item["title"] for item in response))
df = pd.DataFrame(response)
print(df)
示例输出:(这是一个屏幕截图,因为 SO 认为输出是垃圾邮件...)
我想抓取来自this link的内容:
如何循环所有页面并爬取红色圆圈中的所有元素?谢谢。
代码:
from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse
url = 'http://www.eoechina.com.cn/cn2019/gonggaoxinxi.html?classID=1'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup)
有一个端点可以查询以遍历页面。
方法如下:
from urllib.parse import urlencode
import requests
import pandas as pd
end_point = "http://www.eoechina.com.cn/cn2016/mobile/GetArticleList.ashx"
payload = {
"pageNumber": 1,
"classID": 1,
"searchKey": "",
"selectItemID": "72,"
}
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:86.0) "
"Gecko/20100101 Firefox/86.0",
"X-Requested-With": "XMLHttpRequest",
}
for page in range(1, 5):
payload["pageNumber"] = page
response = requests.post(
end_point,
data=urlencode(payload),
headers=headers,
).json()
# print("\n".join(item["title"] for item in response))
df = pd.DataFrame(response)
print(df)
示例输出:(这是一个屏幕截图,因为 SO 认为输出是垃圾邮件...)