我怎样才能从这个网站上抓取 JSON 文件？

Question

我找不到解决我遇到的问题的方法。

我想从https://www.armadarealestate.com/Inventory.aspx

抓取JSON文件

当我访问网络并select url 正在加载 JSON 时，我只是被发送到另一个 HTML 页面，但是响应部分说它包含有关我需要的属性的信息。

那么如何从网站上提取 JSON 文件？


import json
import requests

resp = requests.get(url='https://buildout.com/plugins/3e0f3893dc334368bb1ee6274ad5fd7b546414e9/inventory?utf8=%E2%9C%93&page=-3&brandingId=&searchText=&q%5Bsale_or_lease_eq%5D=&q%5Bs%5D%5B%5D=&viewType=list&q%5Btype_eq_any%5D%5B%5D=2&q%5Btype_eq_any%5D%5B%5D=5&q%5Btype_eq_any%5D%5B%5D=1&q%5Bcity_eq%5D=')

print(json.loads(resp.text))

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

事实上，当我提取属于 JSON 文件的请求时，我反而从 'https://buildout.com/plugins/3e0f3893dc334368bb1ee6274ad5fd7b546414e9/inventory?utf8=%E2%9C%93&page=0&brandingId=&searchText=&q%5Bsale_or_lease_eq%5D=&q%5Bs%5D%5B%5D=&viewType=list&q%5Btype_eq_any%5D%5B%5D=2&q%5Btype_eq_any%5D%5B%5D=5&q%5Btype_eq_any%5D%5B%5D=1&q%5Bcity_eq%5D=' 处抓取 url 获得了响应这是一个 html 文件。

我该如何解决这个问题？

Answer 1

您的响应对象 "resp" 不是有效的 JSON 格式。这只是一个 html 的内容。您可以使用 beautifulsoup 从 html.

中抓取内容

您没有获得 JSON 对象的原因是 html 中的 Javascript。 Python 请求只单独下载 html 文档，如果你想呈现 Javascript 使用像 selenium 这样的库。

否则，找到通过 ajax 加载 JSON 的 URL 并使用请求获取 JSON.

在你的例子中，要抓取的测试代码 JSON:

import requests
url = "https://buildout.com/plugins/3e0f3893dc334368bb1ee6274ad5fd7b546414e9/inventory?utf8=%E2%9C%93&page=0&brandingId=&searchText=&q%5Bsale_or_lease_eq%5D=&q%5Bs%5D%5B%5D=&viewType=list&q%5Btype_eq_any%5D%5B%5D=2&q%5Btype_eq_any%5D%5B%5D=5&q%5Btype_eq_any%5D%5B%5D=1&q%5Bcity_eq%5D="
h = {'accept': 'application/json', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'}  
r = requests.get(url, headers=h) 
print(r.json())

#prints the JSON data

我怎样才能从这个网站上抓取 JSON 文件？

How can I scrape the JSON file off this website?

python

json

scrapy

web-scraping

python-requests