Python 嵌套 Json 中的请求 Post - 检索具有特定值的数据

Python Requests Post within a nested Json - retrieve data with a specific value

我已经查看了 Whosebug,但找不到问题的答案。 我正在访问来自德国政府的 API,它的输出限制为 10.000 个条目。我想要来自特定城市的所有数据,并且由于原始数据库中有超过 10.000 个条目,所以我需要在执行 requests.post.

时“执行查询”

这是 Json 结果的一个条目,当我简单地对 API 执行 request.post 时:

{
    "results":[
        {
            "_id":"CXPTYYFY807",
            "CREATED_AT":"2019-12-17T14:48:17.130Z",
            "UPDATED_AT":"2019-12-17T14:48:17.130Z",
            "result":{
                "id":"CXPTYYFY807",
                "title":"Bundesstadt Bonn, SGB-315114, Ortsteilzentrum Brüser Berg, Fliesenarbeiten",
                "description":["SGB-315114","Ortsteilzentrum Brüser Berg, Fliesenarbeiten"],
                "procedure_type":"Ex ante Veröffentlichung (§ 19 Abs. 5)",
                "order_type":"VOB",
                "publication_date":"",
                "cpv_codes":["45431000-7","45431100-8"],
                "buyer":{
                    "name":"Bundesstadt Bonn, Referat Vergabedienste",
                    "address":"Berliner Platz 2",
                    "town":"Bonn",
                    "postal_code":"53111"},
                    "seller":{
                        "name":"",
                        "town":"",
                        "country":""
                        
                    },
                    "geo":{
                        "lon":7.0944,
                        "lat":50.73657
                        
                    },
                    "value":"",
                    "CREATED_AT":"2019-12-17T14:48:17.130Z",
                    "UPDATED_AT":"2019-12-17T14:48:17.130Z"}
            
        }
        ],
    "aggregations":{},
    "pagination":{
        "total":47389,
        "start":0,
        "end":0 }}

我要的是"town" : "Bonn"

买的所有数据

我已经尝试过的:

import requests 

url = 'https://daten.vergabe.nrw.de/rest/evergabe/aggregation_search'
headers = {'Accept': 'application/json', 'Content-Type': 'application/json'}
data = {"results": [{"result": {"buyer": {"town":"Bonn"}}}]}

#need to put the size limit, otherwise he delivers me less:
params = {'size': 10000}
 
req = requests.post(url, params=params, headers=headers, json=data)

这个 returns 我 post,但没有按城市“过滤”。 我也试过 req = requests.post(url, params=params, headers=headers, data=data) , returns 我的错误 400 .

另一种方法是在循环中 json 代码末尾使用分页参数获取所有数据,但同样我无法写下 json 路径到分页,例如:start: 0 , end:500

谁能帮我解决一下?

尝试:

url = 'https://daten.vergabe.nrw.de/rest/evergabe/aggregation_search'
 
headers = {'Accept': 'application/json', 'Content-Type': 'application/json'}

query1 = {
    "query": {
        "match": {
            "buyer.town": "Bonn"
        }
    }
}

req = requests.post(url, headers=headers, json=query1)

# Check the output
req.text

编辑: 如果过滤器匹配的结果超过 10,000 个,这将不起作用,但它可能是解决您面临的问题的快速解决方法。

import json
import requests
import math

url = "https://daten.vergabe.nrw.de/rest/vmp_rheinland"

size = 5000


payload = '{"sort":[{"_id":"asc"}],"query":{"match_all":{}},"size":'+str(size)+'}'
headers = {
    'accept': "application/json",
    'content-type': "application/json"
    'cache-control': "no-cache"
    }

response = requests.request("POST", url, data=payload, headers=headers)

tenders_array = []

query_data = json.loads(response.text)

tenders_array.extend(query_data['results'])

total_hits = query_data['pagination']['total']
result_size = len(query_data['results'])
last_id = query_data['results'][-1]["_id"]

number_of_loops = ((total_hits - size) // size )
last_loop_size = ((total_hits - size) % size)


for i in range(number_of_loops+1):
    if i == number_of_loops:
        size=last_loop_size
    payload = '{"sort":[{"_id":"asc"}],"query":{"match_all":{}},"size":'+str(size)+',"search_after":["'+last_id+'"]}'
    response = requests.request("POST", url, data=payload, headers=headers)
    query_data = json.loads(response.text)
    result_size = len(query_data['results'])
    if result_size > 0:
        tenders_array.extend(query_data['results'])
        last_id = query_data['results'][-1]["_id"] 
    else:
        break

https://gist.github.com/thiagoalencar/34401e204358499ea3b9aa043a18395f 要点中的代码。

通过 elasticsearch 进行分页的一些代码 API。这是 elasticsearch API 之上的 API,文档不是很清楚。尝试滚动,没有成功。此解决方案使用没有时间点的 search_after 参数,因为端点不可用。有时服务器会拒绝请求,需要使用 response.status_code==502.

进行验证

代码比较乱,需要重构。但它有效。最后的 tenders_array 包含所有对象。