Scrapy 请求没有通过

Scrapy request not going through

我不知道如何准确地提出这个问题。我是网络抓取的初学者,我正在尝试使用 Python Scrapy 抓取网站。该网站是动态的,使用 javascript,无法使用基本级别的 xpath 和 CSS 选择器检索任何数据。

我试图通过请求 url 来通过我的蜘蛛模拟 API 请求,其中数据在 json object 中。该请求 url 抛出 HTTP 状态代码未处理或不允许错误。 我想我打错了URL。这种直接调用 json object url 的方法对我有用了 9/10 次。我能做什么不同? url 在 headers 部分有参数和表单数据项,url 甚至看起来都不像一个有效的网站 url 它以 https://ih3kc909gb-dsn.algolia.net/1/indexes... 开头。 我知道这是一个很长的问题,但我真的需要一些帮助来解决这个问题?

您应该使用 start_requests() 方法而不是 start_urls 属性。您可以从 here 阅读更多相关信息。现在,您需要做的就是发出 POST 请求。

代码

import scrapy

class carswitch(scrapy.Spider):
    name = 'car'

    headers = {
        "Connection": "keep-alive",
        "Pragma": "no-cache",
        "Cache-Control": "no-cache",
        "sec-ch-ua": "\" Not;A Brand\";v=\"99\", \"Google Chrome\";v=\"91\", \"Chromium\";v=\"91\"",
        "accept": "application/json",
        "sec-ch-ua-mobile": "?0",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
        "content-type": "application/x-www-form-urlencoded",
        "Origin": "https://carswitch.com",
        "Sec-Fetch-Site": "cross-site",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Dest": "empty",
        "Referer": "https://carswitch.com/",
        "Accept-Language": "en-US,en;q=0.9"
    }

    body = '{"params":"query=&hitsPerPage=24&page=0&numericFilters=%5B%22country_id%3D1%22%2C%22used_car%20%3D%201%22%5D&facetFilters=&typoTolerance=&tagFilters=%5B%5D&attributesToHighlight=%5B%5D&attributesToRetrieve=%5B%22make%22%2C%22make_ar%22%2C%22model%22%2C%22model_ar%22%2C%22year%22%2C%22trim%22%2C%22displayTrim%22%2C%22colorPaint%22%2C%22bodyType%22%2C%22salePrice%22%2C%22transmissionType%22%2C%22GPS%22%2C%22carID%22%2C%22inspectionID%22%2C%22inspectionStatus%22%2C%22rate%22%2C%22certified_dealer_id%22%2C%22dealer_category%22%2C%22used_car%22%2C%22new%22%2C%22top_condition%22%2C%22featured%22%2C%22photo%22%2C%22modifiedPlace%22%2C%22city%22%2C%22mileage%22%2C%22urgent_sales%22%2C%22price_dropped%22%2C%22urgent_sales_days%22%2C%22urgent_sales_end_date%22%2C%22date%22%2C%22negotiable%22%2C%22oldPrice%22%2C%22zero_downpayment%22%2C%22cashOnly%22%2C%22hasPriceGuidance%22%2C%22dealerOffer%22%2C%22maxPrice%22%2C%22fairPrice%22%2C%22pricey_deal%22%2C%22fair_deal%22%2C%22good_deal%22%2C%22great_deal%22%2C%22dealership_info%22%2C%22logo_small%22%2C%22GCCspecs%22%2C%22country%22%2C%22export%22%2C%22monthly_price%22%5D"}'

    def start_requests(self):
        url = 'https://ih3kc909gb-dsn.algolia.net/1/indexes/All_Carswitch_Cars/query?x-algolia-agent=Algolia%20for%20JavaScript%20(3.33.0)%3B%20Browser&x-algolia-application-id=IH3KC909GB&x-algolia-api-key=493a9bbc57331df3b278fa39c1dd8f2d'    

        yield Request(url=url, method='POST', headers=self.headers, body=self.body, callback=self.parse)


    def parse(self,response):

        print(response.body)