scrapy shell 响应与 scrapy 爬网响应不同

Question

我已经重新创建了 XHR 请求。当我们在浏览器 window 中输入 XHR 请求 URL 时，因为如果第一次点击它是一个 GET 方法，我得到了部分 JSON 输出。如果我们点击重新加载，下次它会加载更多看起来很奇怪的数据。谁能帮我解决这个问题。提前致谢

我在 Scrapy shell 中尝试的另一个信息也给出了完整的 JSON 响应。

代码：

import scrapy
import datetime
import time
from scrapy.http.request import Request

class test (scrapy.Spider):
    name = "test"
    allowed_domains = "ar.trivago.com"

    def start_requests(self):
        yield scrapy.Request("http://ar.trivago.com/search/region?iPathId=38715&iGeoDistanceItem=47160&aDateRange%5Barr%5D=2015-11-13&aDateRange%5Bdep%5D=2015-11-14&iRoomType=7&tgs=4716002&aHotelTestClassifier=&aPriceRange%5Bfrom%5D=0&aPriceRange%5Bto%5D=0&iIncludeAll=0&iGeoDistanceLimit=20000&aPartner=&iViewType=0&bIsSeoPage=false&bIsSitemap=false&&_=1446825699501",
                         callback=self.parse)

    def parse(self, response):
        print "RESPONSE::", response.body

请帮我解决这个问题

Answer 1

您正在使用编码 url 发出请求。 Scrapy 正在重新编码，看起来 objective 网站不支持 double-encoding.

此外，重要的是要提到一些具有 API 端点的网站有一项保护措施，即检查您是否已经拥有 session。这显然是为了避免直接请求到他们的端点。所以在这种情况下，总是建议在查询他们的 API/endpoint.

之前发出第一个 "fake" 请求（这将生成一个 session）

An example of the above is this answer on SO:

https://whosebug.com/a/33542753/4120036

Just check how it first makes a request to LOGIN_PAGE:
s.get(LOGIN_URL)
And then it makes the login post request:
login_response = s.post(LOGIN_URL, data=payload, headers={'Referer':'http://infotrac.galegroup.com/default/palm83799?db=SP19', 'Content-Type':'application/x-www-form-urlencoded'})

我已经解码了网站 URL，添加了 X-Requested-With 和 Referer headers它现在 returns 与您浏览器的数据量相同：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http.request import Request

class test(scrapy.Spider):
    name = "test"
    allowed_domains = ["ar.trivago.com"]

    def start_requests(self):
        headers = {
                'Referer': "http://ar.trivago.com/?iPathId=38715&iGeoDistanceItem=47160&aDateRange[arr]=2016-01-01&aDateRange[dep]=2016-01-02&iRoomType=7&tgs=4716002&aHotelTestClassifier=&aPriceRange[from]=0&aPriceRange[to]=0&iIncludeAll=0&iGeoDistanceLimit=20000&aPartner=&iViewType=0&bIsSeoPage=false&bIsSitemap=false&",
                'X-Requested-With':'XMLHttpRequest'
            }
        fake_request = Request("http://ar.trivago.com/search/region?iPathId=38715&iGeoDistanceItem=47160&aDateRange[arr]=2015-11-13&aDateRange[dep]=2015-11-14&iRoomType=7&tgs=4716002&aHotelTestClassifier=&aPriceRange[from]=0&aPriceRange[to]=0&iIncludeAll=0&iGeoDistanceLimit=20000&aPartner=&iViewType=0&bIsSeoPage=false&bIsSitemap=false&&_=1446825699501", headers=headers)
        yield Request("http://ar.trivago.com/search/region?iPathId=38715&iGeoDistanceItem=47160&aDateRange[arr]=2015-11-13&aDateRange[dep]=2015-11-14&iRoomType=7&tgs=4716002&aHotelTestClassifier=&aPriceRange[from]=0&aPriceRange[to]=0&iIncludeAll=0&iGeoDistanceLimit=20000&aPartner=&iViewType=0&bIsSeoPage=false&bIsSitemap=false&&_=1446825699501",
                         callback=self.parse, headers=headers)

    def parse(self, response):
        print "RESPONSE:", response.body

Answer 2

大家好，我找到了基于 Andres 代码的解决方案

@Andrés Pérez-Albela H. 我修改了 code.that 会给我来自网站的实际回复。由于未正确创建并发请求执行会话，因此大部分时间响应都是部分的。 Crawling with an authenticated session in Scrapy 这个 post 帮我搞清楚了。感谢@Acorn 和@Andrés Pérez-Albela H.

# -*- coding: utf-8 -*-
import scrapy
import time
from scrapy.http.request import Request
headers = {
    'Referer': "http://ar.trivago.com/?iPathId=38715&iGeoDistanceItem=47160&aDateRange[arr]=2016-01-01&aDateRange[dep]=2016-01-02&iRoomType=7&tgs=4716002&aHotelTestClassifier=&aPriceRange[from]=0&aPriceRange[to]=0&iIncludeAll=0&iGeoDistanceLimit=20000&aPartner=&iViewType=0&bIsSeoPage=false&bIsSitemap=false&",
    'X-Requested-With':'XMLHttpRequest'
    }
class test(scrapy.Spider):
    name = "test"
    allowed_domains = ["ar.trivago.com"]
    def start_requests(self):
        yield Request("http://ar.trivago.com/search/region?iPathId=38715&iGeoDistanceItem=47160&aDateRange[arr]=2015-11-13&aDateRange[dep]=2015-11-14&iRoomType=7&tgs=4716002&aHotelTestClassifier=&aPriceRange[from]=0&aPriceRange[to]=0&iIncludeAll=0&iGeoDistanceLimit=20000&aPartner=&iViewType=0&bIsSeoPage=false&bIsSitemap=false&&_=1446825699501",
                      callback=self.parse, headers=headers)
    def parse(self, response):
        yield Request("http://ar.trivago.com/search/region?iPathId=38715&iGeoDistanceItem=47160&aDateRange[arr]=2015-11-13&aDateRange[dep]=2015-11-14&iRoomType=7&tgs=4716002&aHotelTestClassifier=&aPriceRange[from]=0&aPriceRange[to]=0&iIncludeAll=0&iGeoDistanceLimit=20000&aPartner=&iViewType=0&bIsSeoPage=false&bIsSitemap=false&&_=1446825699501",
                         callback=self.parse_final, headers=headers, dont_filter = 'TRUE')
    def parse_final(self, response):
        print "RESPONSE:", response.body

对我有用，谢谢大家的帮助。

scrapy shell 响应与 scrapy 爬网响应不同

scrapy shell response differe from scrapy crawl response

python

xmlhttprequest

request

scrapy

web-scraping