如何从点击按钮时显示的 table 中抓取所有数据？

Question

我正在尝试抓取网站：https://gmatclub.com/forum/decision-tracker.html我需要获取决策跟踪器table - 实时更新。下面的代码 为我提供了当前页面中存在的数据 。

向下滚动时，会出现一个 'show more' 按钮，可以显示旧条目。从 table 获取所有数据的方法是什么？（所有 5500 多个条目）

import requests
import pandas as pd

with requests.Session() as connection:
    connection.headers.update(
        {
            "referer": "https://gmatclub.com/forum/decision-tracker.html",
            "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.86 YaBrowser/21.3.0.740 Yowser/2.5 Safari/537.36",
        }
    )
    _ = connection.get("https://gmatclub.com/forum/decision-tracker.html")
    endpoint = connection.get("https://gmatclub.com/api/schools/v1/forum/app-tracker-latest-updates?limit=50&year=all").json()
    for item in endpoint["statistics"]:
        print(item)

#df = pd.DataFrame(endpoint["statistics"])
#print(df.head())
#df.to_csv("your_table_data.csv", index=False)

Answer 1

解决您的问题的一种快速简便的方法是在 params 中定义最高限制（您希望获取的数据达到）。我解析 id 只是为了让您知道它有效。您可以坚持使用数据框方法。

import requests

link = 'https://gmatclub.com/api/schools/v1/forum/app-tracker-latest-updates'
params = {
    'limit': 500,
    'offset': 0,
    'year': 'all'
}

with requests.Session() as con:
    con.headers["User-Agent"] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.86 YaBrowser/21.3.0.740 Yowser/2.5 Safari/537.36"
    con.get("https://gmatclub.com/forum/decision-tracker.html")
    while True:
        endpoint = con.get(link,params=params).json()
        if not endpoint["statistics"]:break
        for item in endpoint["statistics"]:
            print(item['id'])

        params['offset']+=499

如何从点击按钮时显示的 table 中抓取所有数据？

How to scrape all the data from the table which reveals on clicking button?

python

xmlhttprequest

web-scraping

python-requests