滚动后如何获取完整的网页

Question

我想要一个网页。它有一个食谱列表，但是要访问它们，您必须向下滚动才能加载它们。我正在使用 requests_html 库来获取网页。我仔细阅读了这篇文章，看到有人说您可以更改请求以使其提供网页的“子页面”，但这似乎没有什么不同。我还尝试将 scrolldown=2000, sleep=2 添加到 resp.html.render() 函数，但这似乎也没有什么区别。这是一些示例代码：

from requests_html import *
from bs4 import *

def get(url):
    session = HTMLSession()
    resp = session.get(url)
    resp.html.render()
    data = resp.html.html
    resp.close()
    session.close()
    html = BeautifulSoup(data, "html.parser")
    print(len(html.find_all("div", class_="search-results-matrix__item"))) #this prints how many recipies were found, which always gives me 20 which is how many are seen before scrolling

get("https://app.ckbk.com/search?q=recipes&sort=popularity&book_full_title%5B0%5D=Splendid%20Soups&p=1")
get("https://app.ckbk.com/search?q=recipes&sort=popularity&book_full_title%5B0%5D=Splendid%20Soups&p=5") #here I tried to change the last bit of the request to p=5 rather than 1, as that is what changed when I scrolled

感谢任何帮助。

Answer 1

更好的方法是调用站点的 JSON API，这将 return 您需要的所有信息。我建议你 print(j) 查看 returned 数据的结构：

import requests
import json

search = "Splendid Soups"
# up to 500 returned
post_data = """{"query":{"bool":{"must":[{"bool":{"should":[{"simple_query_string":{"query":"recipes","fields":["authors^15","book_authors^12","title^10","title.exact^10","name^10","subtitle^5","subtitle.exact^5","book_title^2","book_full_title^2","book_subtitle","country^2","ingredients","ingredients.exact","additional_keywords","diet_type^2","course","skill","headings","book_licensor"],"default_operator":"and"}},{"match_phrase_prefix":{"full_title":{"query":"recipes","boost":300}}}]}},{"term":{"_type":"Recipe"}}]}},"post_filter":{"bool":{"must":[{"term":{"is_licensed":true}},{"term":{"is_indexable":true}},{"term":{"book_full_title.raw":"xxxxx"}}]}},"aggs":{"label3":{"filter":{"bool":{"must":[{"term":{"is_licensed":true}},{"term":{"is_indexable":true}},{"term":{"book_full_title.raw":"xxxxx"}}]}},"aggs":{"label.raw":{"terms":{"field":"label.raw","size":50}},"label.raw_count":{"cardinality":{"field":"label.raw"}}}},"country6":{"filter":{"bool":{"must":[{"term":{"is_licensed":true}},{"term":{"is_indexable":true}},{"term":{"book_full_title.raw":"xxxxx"}}]}},"aggs":{"country.raw":{"terms":{"field":"country.raw","size":50}},"country.raw_count":{"cardinality":{"field":"country.raw"}}}},"course7":{"filter":{"bool":{"must":[{"term":{"is_licensed":true}},{"term":{"is_indexable":true}},{"term":{"book_full_title.raw":"xxxxx"}}]}},"aggs":{"course.raw":{"terms":{"field":"course.raw","size":50}},"course.raw_count":{"cardinality":{"field":"course.raw"}}}},"diet_type8":{"filter":{"bool":{"must":[{"term":{"is_licensed":true}},{"term":{"is_indexable":true}},{"term":{"book_full_title.raw":"xxxxx"}}]}},"aggs":{"diet_type.raw":{"terms":{"field":"diet_type.raw","size":50}},"diet_type.raw_count":{"cardinality":{"field":"diet_type.raw"}}}},"skill9":{"filter":{"bool":{"must":[{"term":{"is_licensed":true}},{"term":{"is_indexable":true}},{"term":{"book_full_title.raw":"xxxxx"}}]}},"aggs":{"skill.raw":{"terms":{"field":"skill.raw","size":50}},"skill.raw_count":{"cardinality":{"field":"skill.raw"}}}},"era10":{"filter":{"bool":{"must":[{"term":{"is_licensed":true}},{"term":{"is_indexable":true}},{"term":{"book_full_title.raw":"xxxxx"}}]}},"aggs":{"era.raw":{"terms":{"field":"era.raw","size":50}},"era.raw_count":{"cardinality":{"field":"era.raw"}}}},"authors11":{"filter":{"bool":{"must":[{"term":{"is_licensed":true}},{"term":{"is_indexable":true}},{"term":{"book_full_title.raw":"xxxxx"}}]}},"aggs":{"authors.raw":{"terms":{"field":"authors.raw","size":50}},"authors.raw_count":{"cardinality":{"field":"authors.raw"}}}},"book_full_title12":{"filter":{"bool":{"must":[{"term":{"is_licensed":true}},{"term":{"is_indexable":true}}]}},"aggs":{"book_full_title.raw":{"terms":{"field":"book_full_title.raw","size":50}},"book_full_title.raw_count":{"cardinality":{"field":"book_full_title.raw"}}}},"book_title13":{"filter":{"bool":{"must":[{"term":{"is_licensed":true}},{"term":{"is_indexable":true}},{"term":{"book_full_title.raw":"xxxxx"}}]}},"aggs":{"book_title.raw":{"terms":{"field":"book_title.raw","size":50}},"book_title.raw_count":{"cardinality":{"field":"book_title.raw"}}}},"no_filters_top_hits":{"top_hits":{"size":1,"_source":false}}},"size":500,"from":0,"sort":[{"rank":{"order":"desc"}},{"_score":{"order":"desc"}}],"suggest":{"text":"recipes","suggestions":{"phrase":{"field":"title","real_word_error_likelihood":0.95,"max_errors":1,"gram_size":4,"direct_generator":[{"field":"title","suggest_mode":"always","min_word_length":1}]}}}}"""
post_data = post_data.replace("xxxxx", search)
url = f"https://app.ckbk.com/es-search/_search"
r = requests.post(url, json=json.loads(post_data))
j = r.json()

hits = j['hits']['hits']
print(f"{len(hits)} recipes found\n")

for hit in hits:
    print(f"{hit['_source']['title']} - {', '.join(hit['_source']['book_authors'])}")

这会给你输出开始：

273 recipes found

Senegalese Peanut Soup - James Peterson
Tomatillo and Sorrel Soup with Hominy - James Peterson
French Pork and Cabbage Soup - James Peterson
Swiss Chard, Parsley, and Garlic Soup - James Peterson
Foie Gras and Truffle Soup - James Peterson

这种 post 格式是通过查看我浏览器网络工具中的请求找到的。

浏览器实际上进行了此调用，然后采用 returned JSON 将其格式化为 HTML.

滚动后如何获取完整的网页

How to get the complete webpage after scrolling

python

web-scraping

python-requests-html