Scrapy CrawlSpider 不关注链接

Question

我正在尝试从该类别页面上给出的所有 (#123) 详细信息页面中抓取一些属性 - http://stinkybklyn.com/shop/cheese/ 但是 scrapy 无法遵循我设置的 link 模式，我检查了scrapy 文档和一些教程，但没有运气！

代码如下：

import scrapy

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class Stinkybklyn(CrawlSpider):
    name = "Stinkybklyn"
    allowed_domains = ["stinkybklyn.com"]
    start_urls = [
        "http://stinkybklyn.com/shop/cheese/chandoka",
    ]
    Rule(LinkExtractor(allow=r'\/shop\/cheese\/.*'),
         callback='parse_items', follow=True)


    def parse_items(self, response):
        print "response", response
        hxs= HtmlXPathSelector(response)
        title=hxs.select("//*[@id='content']/div/h4").extract()
        title="".join(title)
        title=title.strip().replace("\n","").lstrip()
        print "title is:",title

有人可以告诉我我在这里做错了什么吗？

Answer 1

您似乎有一些语法错误。试试这个，

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector


class Stinkybklyn(CrawlSpider):
    name = "Stinkybklyn"
    allowed_domains = ["stinkybklyn.com"]
    start_urls = [
        "http://stinkybklyn.com/shop/cheese/",
    ]

    rules = (
            Rule(LinkExtractor(allow=(r'/shop/cheese/')), callback='parse_items'),

        )

    def parse_items(self, response):
    print "response", response

Answer 2

您的代码的关键问题是 您没有为 CrawlSpider 设置 rules。

我建议的其他改进：

不需要实例化HtmlXPathSelector，可以直接使用response
select() 现在已弃用，请使用 xpath()
获取 title 元素的 text() 以便检索，例如，获取 Chandoka 而不是 <h4>Chandoka</h4>
我认为您打算改为从奶酪店目录页面开始：http://stinkybklyn.com/shop/cheese

应用改进的完整代码：

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule


class Stinkybklyn(CrawlSpider):
    name = "Stinkybklyn"
    allowed_domains = ["stinkybklyn.com"]

    start_urls = [
        "http://stinkybklyn.com/shop/cheese",
    ]

    rules = [
        Rule(LinkExtractor(allow=r'\/shop\/cheese\/.*'), callback='parse_items', follow=True)
    ]

    def parse_items(self, response):
        title = response.xpath("//*[@id='content']/div/h4/text()").extract()
        title = "".join(title)
        title = title.strip().replace("\n", "").lstrip()
        print "title is:", title

Scrapy CrawlSpider 不关注链接

Scrapy CrawlSpider not following links

python

web-crawler

scrapy

web-scraping

scrapy-spider