有没有更有效的方法来访问 JavaScript Table 而无需 Selenium？

Question

我目前正在做一个副项目来抓取 returns 一个 table 用 JavaScript.

呈现的 Web 表单的结果

我已经设法通过 Selenium 轻松地完成这项工作。但是，我基于 CSV 文件查询此表单大约 5,000 次，这导致处理时间较长（大约 9 小时）。

我想知道是否有一种方法可以使用生成的请求 URL 通过 Python 直接访问响应数据，而不是呈现 JavaScript.

有问题的网站表单：https://probatesearch.service.gov.uk/

捕获的网络请求示例URL完成表单的两个部分后（输入 1996 年之前的年份将输出不同的响应，这些响应可以忽略）：

https://probatesearch.service.gov.uk/api/nuxeo/api/v1/search/pp/pp_mainstream_default_search/execute?pageProvider=pp_mainstream_default_search&currentPageIndex=0&hmcts_grant_schema_grantdocTypeOf=1&hmcts_grant_schema_surname=SMITH&hmcts_grant_schema_dateofdeath_min=2019-03-23T00%3A00%3A00.000Z&hmcts_grant_schema_dateofdeath_max=2019-03-23T00%3A00%3A00.000Z&hmcts_grant_schema_dateofprobate_min=2022-02-01T00%3A00%3A00.000Z&hmcts_grant_schema_dateofprobate_max=2022-03-02T00%3A00%3A00.000Z&hmcts_grant_schema_firstnames=TREVOR&sortBy=&sortOrder=DESC

我尝试使用 BeautifulSoup、urllib 和 requests 处理此请求，但没有成功提取有意义的数据，但是在网络抓取方面我相当业余。

我使用 urllib 或 requests 得到的输出如下： JSON Response

遗憾的是，这不包括所请求的任何实际数据table（例如姓名、死亡日期等）

我希望将 table 响应（如果有）捕获到 JSON 或 Dataframe 中以供进一步处理。感谢任何帮助。

编辑：这是 table 表单完成并请求后我尝试访问的数据的屏幕截图： Required Table

Answer 1

一般的答案是，英国政府（或者可能只是法院系统）似乎正在实施 API 来访问您正在寻找的数据类型 - 您一定要仔细阅读一般 API 秒。

更具体地说，在您的情况下，可以通过 API 调用获取数据，您可以使用浏览器中的开发者选项卡查看该调用。 See more here, for one of many examples.

所以在这种情况下，我假设你知道一些（但不是全部）关于这个案子的信息（在下面的例子中，你知道姓氏、死亡年份和遗嘱认证年份）并发送 API 包含该信息的请求。该调用检索 7 个条目。

import requests
import json

url = 'https://probatesearch.service.gov.uk/api/nuxeo/api/v1/search/pp/pp_mainstream_default_search/execute'

last_name, death, probate = 'SMITH',2019,2022
targets = ['hmctsgrant:surname','hmctsgrant:firstnames','hmctsgrant:dateofdeath','hmctsgrant:dateofprobate','hmctsgrant:probatenumber',
    'hmctsgrant:grantdocTypeoOfName','hmctsgrant:registryofficename']

param_dict = (
    ('pageProvider', 'pp_mainstream_default_search'),
    ('currentPageIndex', '0'),
    ('hmcts_grant_schema_grantdocTypeOf', '1'),
    ('hmcts_grant_schema_surname', f'{last_name}'),
    ('hmcts_grant_schema_dateofdeath_min', f'{death}-01-01T00:00:00.000Z'),
    ('hmcts_grant_schema_dateofdeath_max',f'{death}-12-31T00:00:00.000Z'),
    ('hmcts_grant_schema_dateofprobate_min', f'{probate}-01-01T00:00:00.000Z'),
    ('hmcts_grant_schema_dateofprobate_max', f'{probate}-12-31T00:00:00.000Z'),
    ('hmcts_grant_schema_firstnames', ''),
    ('sortBy', ''),
    ('sortOrder', 'DESC'),
)
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0',
    'Accept': 'application/json',
    'Referer': 'https://probatesearch.service.gov.uk/search-results',
    'properties': 'hmcts_grant_schema',

}
response = requests.get(url, headers=headers, params=param_dict, cookies=cookies)

data = json.loads(response.text)
for entry in data['entries']:
    info = entry['properties']        
    for target in targets:
        print(info[target])
    print('------------')

这种情况下的输出是

Smith
Trevor Floyd
2019-03-23T00:00:00.000Z
2022-02-03T00:00:00.000Z
1641476859693801
ADMINISTRATION
Newcastle
------------
Smith
David William
2019-02-06T00:00:00.000Z
2022-02-04T00:00:00.000Z
1643363130442596
ADMINISTRATION
Newcastle
------------

等等

您可以明显地将输出加载到 pandas 数据框，或您需要使用的任何其他内容。

有没有更有效的方法来访问 JavaScript Table 而无需 Selenium？

Is there a more efficient way of accessing a JavaScript Table without Selenium?

python

web-scraping