使用 beautifulsoup 抓取动态加载页面
Scrape dynamically loading page using beautifulsoup
我是 beautifulsoup 包的新手。我正在尝试从 https://indianrecipes.com/new_and_popular 中抓取所有食物食谱和链接以及配料
问题是这个网站只有在向下滚动时才会加载更多的食物。我提到了这个问题 ,但没怎么讲。
我查看了 inspect element 中的网络选项卡,发现每次向下滚动时,都会发送一个 XHR 请求
api?tm=1565542062069
api?tm=1565542065302
api?tm=1565542073116
api?tm=1565542075617
是否可以在 python 中模拟此类请求以从该页面提取所有食物食谱?
您必须使用 selenium 将 javascript 从网页加载到 html
然后使用 selenium
的滚动代码
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome('/home/sush/Downloads/Compressed/chromedriver_linux64/chromedriver')
driver.get('https://indianrecipes.com/new_and_popular')
heights = []
counter = 0
for i in range(1,300):
bg = driver.find_element_by_css_selector('body')
time.sleep(0.1)
bg.send_keys(Keys.END)
heights.append(driver.execute_script("return document.body.scrollHeight"))
try :
bottom = heights[i-16]
except:
pass
if i%16 ==0:
new_bottom = heights[i-1]
if bottom == new_bottom:
break
然后通过
使用beautifusoup抓取你需要的数据
soup = BeautifulSoup(driver.page_source, 'lxml')
api?tm=1565542075617
中的数字是以毫秒为单位的纪元时间戳。这可能不是请求所必需的。
重要的是要查看请求发送的数据是服务器将响应的数据。在 XHR 请求中向下滚动到 Request Payload
以查看负载。
下面是一个 Python 代码,它在最初的 offset
个食谱之后加载 recipes_per_page
个食谱。
import requests
offset = 50
recipes_per_page = 50
data = [{'jsonrpc': '2.0', 'method': 'recipe.get_trending', 'id': 1, 'params': [offset, recipes_per_page, None, False]}]
response = requests.post('https://indianrecipes.com/api', json=data)
recipes = response.json()[0]['result']['recipes']
我制作了简单的脚本,您可以在其中指定每页的食谱数和要抓取的页数。它 returns 数据 JSON 格式:
from itertools import count, islice
import requests
import json
url = 'https://indianrecipes.com/api'
data = {"id":1,"jsonrpc":"2.0","method":"recipe.get_trending","params":[50,50,None,False]}
per_page = 50
num_pages = 2
for i, c in enumerate( islice(count(0, per_page), 0, num_pages), 1):
print('Page no.{} :'.format(i))
print('-' * 80)
data['params'][0] = c
data['params'][1] = per_page
json_data = requests.post(url, json=data).json()
print(json.dumps(json_data, indent=4))
print('-' * 80)
打印:
Page no.1 :
--------------------------------------------------------------------------------
{
"id": 1,
"jsonrpc": "2.0",
"result": {
"recipes": [
{
"has_video": false,
"id": 8630002,
"image_url": "//lh3.googleusercontent.com/zgZHuLeSg_lKRc66RycpaDoSVMULp3puzoignsoEH40DJBQtOpQi0Ub1L1ET52VFhd3ZUF8r8ZEiD_kEsZNQPloO3_T1KW9sbBE",
"link": "//indianrecipes.com/recipe/Dahi-Vada_Ad3A",
"name": "Dahi Vada",
"rating": 5.0,
"score": 0.0
},
{
"has_video": false,
"id": 9330018,
"image_url": "//lh3.googleusercontent.com/HXd-CD3P0U_v4ItJplGsT5oKZ8mKAAA0AXRsgeOoeLeH4ggvyGRdx-6Y_J1H1EdRLv5De7b5oYqeHkBts4VwIpqBAHNA_OYP8g",
"link": "//indianrecipes.com/recipe/French-Egg-Casserole_D9aa",
"name": "French Egg Casserole",
"rating": 0.0,
"score": 0.0
},
...and so on
我是 beautifulsoup 包的新手。我正在尝试从 https://indianrecipes.com/new_and_popular 中抓取所有食物食谱和链接以及配料
问题是这个网站只有在向下滚动时才会加载更多的食物。我提到了这个问题
api?tm=1565542062069
api?tm=1565542065302
api?tm=1565542073116
api?tm=1565542075617
是否可以在 python 中模拟此类请求以从该页面提取所有食物食谱?
您必须使用 selenium 将 javascript 从网页加载到 html 然后使用 selenium
的滚动代码import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome('/home/sush/Downloads/Compressed/chromedriver_linux64/chromedriver')
driver.get('https://indianrecipes.com/new_and_popular')
heights = []
counter = 0
for i in range(1,300):
bg = driver.find_element_by_css_selector('body')
time.sleep(0.1)
bg.send_keys(Keys.END)
heights.append(driver.execute_script("return document.body.scrollHeight"))
try :
bottom = heights[i-16]
except:
pass
if i%16 ==0:
new_bottom = heights[i-1]
if bottom == new_bottom:
break
然后通过
使用beautifusoup抓取你需要的数据soup = BeautifulSoup(driver.page_source, 'lxml')
api?tm=1565542075617
中的数字是以毫秒为单位的纪元时间戳。这可能不是请求所必需的。
重要的是要查看请求发送的数据是服务器将响应的数据。在 XHR 请求中向下滚动到 Request Payload
以查看负载。
下面是一个 Python 代码,它在最初的 offset
个食谱之后加载 recipes_per_page
个食谱。
import requests
offset = 50
recipes_per_page = 50
data = [{'jsonrpc': '2.0', 'method': 'recipe.get_trending', 'id': 1, 'params': [offset, recipes_per_page, None, False]}]
response = requests.post('https://indianrecipes.com/api', json=data)
recipes = response.json()[0]['result']['recipes']
我制作了简单的脚本,您可以在其中指定每页的食谱数和要抓取的页数。它 returns 数据 JSON 格式:
from itertools import count, islice
import requests
import json
url = 'https://indianrecipes.com/api'
data = {"id":1,"jsonrpc":"2.0","method":"recipe.get_trending","params":[50,50,None,False]}
per_page = 50
num_pages = 2
for i, c in enumerate( islice(count(0, per_page), 0, num_pages), 1):
print('Page no.{} :'.format(i))
print('-' * 80)
data['params'][0] = c
data['params'][1] = per_page
json_data = requests.post(url, json=data).json()
print(json.dumps(json_data, indent=4))
print('-' * 80)
打印:
Page no.1 :
--------------------------------------------------------------------------------
{
"id": 1,
"jsonrpc": "2.0",
"result": {
"recipes": [
{
"has_video": false,
"id": 8630002,
"image_url": "//lh3.googleusercontent.com/zgZHuLeSg_lKRc66RycpaDoSVMULp3puzoignsoEH40DJBQtOpQi0Ub1L1ET52VFhd3ZUF8r8ZEiD_kEsZNQPloO3_T1KW9sbBE",
"link": "//indianrecipes.com/recipe/Dahi-Vada_Ad3A",
"name": "Dahi Vada",
"rating": 5.0,
"score": 0.0
},
{
"has_video": false,
"id": 9330018,
"image_url": "//lh3.googleusercontent.com/HXd-CD3P0U_v4ItJplGsT5oKZ8mKAAA0AXRsgeOoeLeH4ggvyGRdx-6Y_J1H1EdRLv5De7b5oYqeHkBts4VwIpqBAHNA_OYP8g",
"link": "//indianrecipes.com/recipe/French-Egg-Casserole_D9aa",
"name": "French Egg Casserole",
"rating": 0.0,
"score": 0.0
},
...and so on