无法从搜索页面抓取所有结果
Unable to scrape all the results from search page
我正在尝试通过如下请求从以下网站检索所有结果:
class MyPropertySpider(scrapy.Spider):
name = 'my_property'
start_urls = [
'https://www.myproperty.co.za/search?last=1y&coords%5Blat%5D=-33.2277918&coords%5Blng%5D=21.8568586&coords%5Bnw%5D%5Blat%5D=-30.4302599&coords%5Bnw%5D%5Blng%5D=17.7575637&coords%5Bse%5D%5Blat%5D=-47.1313489&coords%5Bse%5D%5Blng%5D=38.2216904&description=Western%20Cape%2C%20South%20Africa&status=For%20Sale',
]
def parse(self, response):
headers = {
'authority': 'jf6e1ij07f.execute-api.eu-west-1.amazonaws.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'accept': 'application/json, text/plain, */*',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Mobile Safari/537.36',
'content-type': 'application/json;charset=UTF-8',
'origin': 'https://www.myproperty.co.za',
'sec-fetch-site': 'cross-site',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.myproperty.co.za/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
data = '{"clientOfficeId":[],"countryCode":"za","sortField":"distance","sortOrder":"asc","last":"0.5y","statuses":["For Sale","Pending Sale","Under Offer","Final Sale","Auction"],"coords":{"lat":"-33.9248685","lng":"18.4240553","nw":{"lat":"-33.47127","lng":"18.3074488"},"se":{"lat":"-34.3598061","lng":"19.00467"}},"radius":2500,"nearbySuburbs":true,"limit":210,"start":0}'
response = requests.post('https://jf6e1ij07f.execute-api.eu-west-1.amazonaws.com/p/search', headers=headers,
data=data)
但是,我只能从该页面获得 200 个结果,即使给定的搜索页面上有 1000 多个结果。我看到请求中的数据限制是 210,当我尝试增加它时它并没有改变。我不确定如何(或者是否可能?)解决这个问题?
有什么建议么?
提前致谢!
由于您使用的是 scrapy,我建议您使用 FormRequest
而不是 requests
库。您可以对两者执行相同的 POST 请求。 Here is the docs 如果您想阅读此方法。
这是您传递的表单数据,它为服务器提供了您感兴趣的所有搜索参数。
data = {
"clientOfficeId": [],
"countryCode":"za",
"sortField":"distance",
"sortOrder":"asc",
"last":"0.5y",
"statuses":["For Sale","Pending Sale","Under Offer","Final Sale","Auction"],
"coords":{"lat":"-33.9248685","lng":"18.4240553",
"nw":{"lat":"-33.47127","lng":"18.3074488"},
"se":{"lat":"-34.3598061","lng":"19.00467"}},
"radius":2500,
"nearbySuburbs":True,
"limit":210,
"start":0
}
因为服务器不愿意一次给你所有的数据(我没有测试过,但是你说增加限制并没有改变结果) ,它希望您对数据进行“分页”,就像您在网站中那样。
当你发送上面的表格时,它 return 给你 210 个结果,所以下次你调用它时你需要告诉服务器你想要 NEXT 210 个结果,与您已经收到的结果不同。为此,您将使用表单中的 start
字段。在您的下一个请求中使用 "start":210
并继续添加直到服务器开始 returning 空响应。 (通常响应不会完全为空,但结果字段 return 为空)
我正在尝试通过如下请求从以下网站检索所有结果:
class MyPropertySpider(scrapy.Spider):
name = 'my_property'
start_urls = [
'https://www.myproperty.co.za/search?last=1y&coords%5Blat%5D=-33.2277918&coords%5Blng%5D=21.8568586&coords%5Bnw%5D%5Blat%5D=-30.4302599&coords%5Bnw%5D%5Blng%5D=17.7575637&coords%5Bse%5D%5Blat%5D=-47.1313489&coords%5Bse%5D%5Blng%5D=38.2216904&description=Western%20Cape%2C%20South%20Africa&status=For%20Sale',
]
def parse(self, response):
headers = {
'authority': 'jf6e1ij07f.execute-api.eu-west-1.amazonaws.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'accept': 'application/json, text/plain, */*',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Mobile Safari/537.36',
'content-type': 'application/json;charset=UTF-8',
'origin': 'https://www.myproperty.co.za',
'sec-fetch-site': 'cross-site',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.myproperty.co.za/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
data = '{"clientOfficeId":[],"countryCode":"za","sortField":"distance","sortOrder":"asc","last":"0.5y","statuses":["For Sale","Pending Sale","Under Offer","Final Sale","Auction"],"coords":{"lat":"-33.9248685","lng":"18.4240553","nw":{"lat":"-33.47127","lng":"18.3074488"},"se":{"lat":"-34.3598061","lng":"19.00467"}},"radius":2500,"nearbySuburbs":true,"limit":210,"start":0}'
response = requests.post('https://jf6e1ij07f.execute-api.eu-west-1.amazonaws.com/p/search', headers=headers,
data=data)
但是,我只能从该页面获得 200 个结果,即使给定的搜索页面上有 1000 多个结果。我看到请求中的数据限制是 210,当我尝试增加它时它并没有改变。我不确定如何(或者是否可能?)解决这个问题? 有什么建议么? 提前致谢!
由于您使用的是 scrapy,我建议您使用 FormRequest
而不是 requests
库。您可以对两者执行相同的 POST 请求。 Here is the docs 如果您想阅读此方法。
这是您传递的表单数据,它为服务器提供了您感兴趣的所有搜索参数。
data = {
"clientOfficeId": [],
"countryCode":"za",
"sortField":"distance",
"sortOrder":"asc",
"last":"0.5y",
"statuses":["For Sale","Pending Sale","Under Offer","Final Sale","Auction"],
"coords":{"lat":"-33.9248685","lng":"18.4240553",
"nw":{"lat":"-33.47127","lng":"18.3074488"},
"se":{"lat":"-34.3598061","lng":"19.00467"}},
"radius":2500,
"nearbySuburbs":True,
"limit":210,
"start":0
}
因为服务器不愿意一次给你所有的数据(我没有测试过,但是你说增加限制并没有改变结果) ,它希望您对数据进行“分页”,就像您在网站中那样。
当你发送上面的表格时,它 return 给你 210 个结果,所以下次你调用它时你需要告诉服务器你想要 NEXT 210 个结果,与您已经收到的结果不同。为此,您将使用表单中的 start
字段。在您的下一个请求中使用 "start":210
并继续添加直到服务器开始 returning 空响应。 (通常响应不会完全为空,但结果字段 return 为空)