POST 使用 Scrapy 在搜索查询中请求

Question

我正在尝试使用 Scrapy 蜘蛛抓取网站，使用 FormRequest 将关键字发送到城市特定页面上的搜索查询。我读到的内容似乎很简单，但我遇到了麻烦。 Python 还很陌生，如果有什么明显的地方我忽略了，请见谅。

以下是我试图用来帮助我的主要 3 个网站：鼠标 vs Python [1]; Stack Overflow; Scrapy.org[3]

具体url的源码我正在爬取：www.lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents

从我找到的特定页面的来源： <input name="dnn$ctl01$txtSearch" type="text" maxlength="255" size="20" id="dnn_ctl01_txtSearch" class="NormalTextBox" autocomplete="off" placeholder="Search..." />我认为搜索的名称是 "dnn_ct101_txtSearch"，我会在我发现引用为 2 的示例中使用它，我想输入 "toyota" 作为我的关键字车辆搜索。

这是我现在的蜘蛛代码，我知道我在开始时导入了过多的东西：

import scrapy
from scrapy.http import FormRequest
from scrapy.item import Item, Field
from scrapy.http import FormRequest
from scrapy.spider import BaseSpider

class LkqSpider(scrapy.Spider):
name = "lkq" 
allowed_domains = ["lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents"]
start_urls = ['http://www.lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents/']

def start_requests(self):
    return [ FormRequest("www.lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents",
                 formdata={'dnn$ctl01$txtSearch':'toyota'},
                 callback=self.parse) ]

def parsel(self):
    print self.status

为什么它不搜索或打印任何类型的结果，我正在复制的示例是否仅用于登录不进入搜索栏的网站？

谢谢，新手 Dan Python 作家

Answer 1

给你:)

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
from scrapy.shell import inspect_response
from scrapy.utils.response import open_in_browser


class Cars(scrapy.Item):
    Make = scrapy.Field()
    Model = scrapy.Field()
    Year = scrapy.Field()
    Entered_Yard = scrapy.Field()
    Section = scrapy.Field()
    Color = scrapy.Field()


class LkqSpider(scrapy.Spider):
    name = "lkq"
    allowed_domains = ["lkqpickyourpart.com"]
    start_urls = (
        'http://www.lkqpickyourpart.com/DesktopModules/pyp_vehicleInventory/getVehicleInventory.aspx?store=224&page=0&filter=toyota&sp=&cl=&carbuyYardCode=1224&pageSize=1000&language=en-US',
    )

    def parse(self, response):
        section_color = response.xpath(
            '//div[@class="pypvi_notes"]/p/text()').extract()
        info = response.xpath('//td["pypvi_make"]/text()').extract()
        for element in range(0, len(info), 4):
            item = Cars()
            item["Make"] = info[element]
            item["Model"] = info[element + 1]
            item["Year"] = info[element + 2]
            item["Entered_Yard"] = info[element + 3]
            item["Section"] = section_color.pop(
                0).replace("Section:", "").strip()
            item["Color"] = section_color.pop(0).replace("Color:", "").strip()
            yield item

        # open_in_browser(response)
        # inspect_response(response, self)

您尝试抓取的页面是由 AJAX 调用生成的。

Scrapy 默认不加载任何动态加载的 Javascript 内容，包括 AJAX。几乎所有在您向下滚动页面时动态加载数据的站点都是使用 AJAX 完成的。 ^^Trapping^^ AJAX 调用非常简单，使用 Chrome Dev Tools 或 Firebug for Firefox。您所要做的就是在 Chrome 开发工具或 Firebug 中观察 XHR 请求。 XHR 是一个 AJAX 请求。

这是它的外观的屏幕截图：

找到 link 后，您可以更改其属性。

这是 Chrome Dev Tools 中的 XHR 请求给我的 link：

http://www.lkqpickyourpart.com/DesktopModules/pyp_vehicleInventory/getVehicleInventory.aspx?store=224&page=0&filter=toyota&sp=&cl=&carbuyYardCode=1224&pageSize=1000&language=en-US

我已经将上面的页面大小更改为 1000，以便每页显示 1000 个结果。默认值为 15。那里还有一个页码，您最好增加页码直到捕获所有数据。

Answer 2

网页需要javascript渲染框架来加载scrapy代码中的内容

用法用Splash and refer the document。

POST 使用 Scrapy 在搜索查询中请求

POST request in search query with Scrapy

python

web-crawler

scrapy

scrapy-spider