如何使用 Scrapy 0.24 抓取站点并仅解析与 RegEx 匹配的页面

How to crawl a site and parse only pages that match a RegEx using Scrapy 0.24

我在 Windows 64 位机器上的 Python 2.7.9 上使用 Scrapy 0.24。我试图告诉 scrapy 从特定的 URL http://www.allen-heath.com/products/ 开始,并且从那里只从 url 包含字符串 ahproducts.[=15= 的页面收集数据]

不幸的是,当我这样做时,根本没有数据被抓取。我究竟做错了什么?下面是我的代码。如果我可以提供更多信息来帮助回答,请询问,我会进行编辑。

这是我的爬虫日志的 pastebin:http://pastebin.com/C2QC23m3

谢谢。

import scrapy
import urlparse

from allenheath.items import ProductItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor

class productsSpider(scrapy.Spider):
    name = "products"
    allowed_domains = ["http://www.allen-heath.com/"]
    start_urls = [
        "http://www.allen-heath.com/products/"
    ]
    rules = [Rule(LinkExtractor(allow=['ahproducts']), 'parse')]

    def parse(self, response):
        for sel in response.xpath('/html'):
            item = ProductItem()
            item['model'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
            item['itemcode'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
            item['shortdesc'] = sel.css('#prodsingleouter > div > div > h3::text').extract()
            item['desc'] = sel.css('#tab1 #productcontent').extract()
            item['series'] = sel.css('#pagestrip > div > div > a:nth-child(3)::text').extract()
            item['imageorig'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
            item['image_urls'] = sel.css('#tab1 #productcontent .col-sm-9 img').xpath('./@src').extract()
            item['image_urls'] = [urlparse.urljoin(response.url, url) for url in item['image_urls']]
            yield item

根据 eLRuLL 的一些建议,这里是我更新的蜘蛛文件。我修改了 start_url 以包含一个页面,该页面的链接在其 URL 中包含 "ahproducts"。我的原始代码在起始页上没有任何匹配的 URL。

products.py

import scrapy
import urlparse

from allenheath.items import ProductItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor

class productsSpider(scrapy.contrib.spiders.CrawlSpider):
    name = "products"
    allowed_domains = ["http://www.allen-heath.com/"]
    start_urls = [
        "http://www.allen-heath.com/key-series/ilive-series/ilive-remote-controllers/"
    ]
    rules = (
            Rule(
                LinkExtractor(allow='.*ahproducts.*'),
                callback='parse_item'
                ),
            )

    def parse_item(self, response):
        for sel in response.xpath('/html'):
            item = ProductItem()
            item['model'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
            item['itemcode'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
            item['shortdesc'] = sel.css('#prodsingleouter > div > div > h3::text').extract()
            item['desc'] = sel.css('#tab1 #productcontent').extract()
            item['series'] = sel.css('#pagestrip > div > div > a:nth-child(3)::text').extract()
            item['imageorig'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
            item['image_urls'] = sel.css('#tab1 #productcontent .col-sm-9 img').xpath('./@src').extract()
            item['image_urls'] = [urlparse.urljoin(response.url, url) for url in item['image_urls']]
            yield item

首先,要使用规则,您需要使用 scrapy.contrib.spiders.CrawlSpider 而不是 scrapy.Spider

然后,将您的方法名称更改为 parse_item 而不是 parse 并更新您的规则,例如:

 rules = (
        Rule(
            LinkExtractor(allow='.*ahproducts.*'),
            callback='parse_item'
        ),
    )

parse 方法总是被调用作为对 start_urls 请求的响应。

最后只把allowed_domains改为allowed_domains = ["allen-heath.com"]

P.D。要使用规则爬取站点的不同级别,您需要指定要跟踪哪些链接以及要解析哪些链接,如下所示:

rules = (
    Rule(
        LinkExtractor(
            allow=('some link to follow')
        ),
        follow=True,
    ),
    Rule(
        LinkExtractor(
            allow=('some link to parse')
        ),
        callback='parse_method',
    ),
)