Python 用于转到下一页的 LinkExtractor 不起作用

Python LinkExtractor to go to next pages doesn't work

接下来是一段代码,我必须尝试抓取超过 1 个页面的网站...我无法使规则 class 正常工作。我做错了什么?

#import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import SkodaItem

class SkodaSpider(CrawlSpider):
    name = "skodas"
    allowed_domains = ["marktplaats.nl"]
    start_urls = [
        "http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always"
    ]

    rules = [
        Rule(LinkExtractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]/a',)), follow=True),
    ]

#    def parse_item(self, response):
    def parse(self, response):
        #self.logger.info('Hi, this is an item page! %s', response.url)
        x = 0
        items = []
        for sel in response.xpath('//*[@id="search-results"]/section[2]/article'):
            x = x + 1
            item = SkodaItem()
            item["title"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').re('.+>(.+)</span>')
            #print sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').extract()
            item["leeftijd"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[1]').re('.+">(.+)</span>')
            item["prijs"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[2]/div[1]/div/div').re('.+\n +(.+)\n.+')
            item["km"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[3]').re('.+">(.+)</span>')

            #handle output (print or safe to database)
            items.append(item)
            print item ["title"],item["leeftijd"],item["prijs"],item["km"]

一些要更改的内容:

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

  • 正如我在评论中提到的,您的 XPath 需要通过删除末尾的额外 /a 来修复(链接中的链接将不匹配任何元素)
  • CrawlSpider 如果您想从后续页面中提取项目,规则需要一个回调方法
  • 要同时解析起始 URL 中的元素,您需要定义一个 parse_start_url method

这是一个极简主义的 CrawlSpider,遵循示例输入的 3 页,并打印出每页中有多少 "articles":

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class SkodaSpider(CrawlSpider):
    name = "skodas"
    allowed_domains = ["marktplaats.nl"]
    start_urls = [
        "http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always"
    ]

    rules = [
        Rule(LinkExtractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]',)),
             follow=True,
             callback='parse_page'),
    ]

    def parse_page(self, response):
        articles = response.css('#search-results > section + section > article')
        self.logger.info('%d articles' % len(articles))

    # define this, otherwise "parse_page" will not be called for the URLs in start_urls
    parse_start_url = parse_page

输出:

$ scrapy runspider 001.py 
2016-02-09 11:07:16 [scrapy] INFO: Scrapy 1.0.4 started (bot: scrapybot)
2016-02-09 11:07:16 [scrapy] INFO: Optional features available: ssl, http11
2016-02-09 11:07:16 [scrapy] INFO: Overridden settings: {}
2016-02-09 11:07:16 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-02-09 11:07:16 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-09 11:07:16 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-09 11:07:16 [scrapy] INFO: Enabled item pipelines: 
2016-02-09 11:07:16 [scrapy] INFO: Spider opened
2016-02-09 11:07:16 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-09 11:07:16 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-02-09 11:07:16 [scrapy] DEBUG: Crawled (200) <GET http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always> (referer: None)
2016-02-09 11:07:16 [skodas] INFO: 32 articles
2016-02-09 11:07:17 [scrapy] DEBUG: Crawled (200) <GET http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151&currentPage=2&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010> (referer: http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always)
2016-02-09 11:07:17 [skodas] INFO: 30 articles
2016-02-09 11:07:17 [scrapy] DEBUG: Crawled (200) <GET http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151&currentPage=3&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010> (referer: http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151&currentPage=2&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010)
2016-02-09 11:07:17 [skodas] INFO: 7 articles
2016-02-09 11:07:17 [scrapy] INFO: Closing spider (finished)
2016-02-09 11:07:17 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1919,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 96682,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 2, 9, 10, 7, 17, 638179),
 'log_count/DEBUG': 4,
 'log_count/INFO': 10,
 'request_depth_max': 2,
 'response_received_count': 3,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2016, 2, 9, 10, 7, 16, 452272)}
2016-02-09 11:07:17 [scrapy] INFO: Spider closed (finished)