如何在 Scrapy Crawler 中跟随下一页来抓取内容

Question

我可以从第一页抓取所有故事，我的问题是如何移动到下一页并继续抓取故事和名称，请检查下面的代码

# -*- coding: utf-8 -*-
import scrapy
from cancerstories.items import CancerstoriesItem
class MyItem(scrapy.Item):
    name = scrapy.Field()
    story = scrapy.Field()
class MySpider(scrapy.Spider):

    name = 'cancerstories'
    allowed_domains = ['thebreastcancersite.greatergood.com']
    start_urls = ['http://thebreastcancersite.greatergood.com/clickToGive/bcs/stories/']

    def parse(self, response):

        rows = response.xpath('//a[contains(@href,"story")]')

        #loop over all links to stories
        for row in rows:
            myItem = MyItem() # Create a new item
            myItem['name'] = row.xpath('./text()').extract() # assign name from link
            story_url = response.urljoin(row.xpath('./@href').extract()[0]) # extract url from link
            request = scrapy.Request(url = story_url, callback = self.parse_detail) # create request for detail page with story
            request.meta['myItem'] = myItem # pass the item with the request
            yield request

    def parse_detail(self, response):
        myItem = response.meta['myItem'] # extract the item (with the name) from the response
        #myItem['name']=response.xpath('//h1[@class="headline"]/text()').extract()
        text_raw = response.xpath('//div[@class="photoStoryBox"]/div/p/text()').extract() # extract the story (text)
        myItem['story'] = ' '.join(map(unicode.strip, text_raw)) # clean up the text and assign to item
        yield myItem # return the item

Answer 1

您可以将 scrapy.Spider 更改为 CrawlSpider，然后使用 Rule 和 LinkExtractor 跟随 link 进入下一页。

对于这种方法，您必须包含以下代码：

...
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
...
rules = (
        Rule(LinkExtractor(allow='\.\./stories;jsessionid=[0-9A-Z]+?page=[0-9]+')),
)
...
class MySpider(CrawlSpider):
...

这样，对于您访问的每个页面，蜘蛛程序都会为下一页（如果存在）创建一个请求，在完成解析方法的执行后跟随它，并再次重复该过程。

编辑：

我写的规则只是跟随下一页link不提取故事，如果你的第一种方法有效，就没有必要改变它。

此外，关于您评论中的规则，SgmlLinkExtractor 已弃用，因此我建议您使用默认值 link extractor，并且规则本身没有明确定义。

当提取器中的参数 attrs 未定义时，它会搜索 links 以查找正文中的 href 标记，在本例中看起来像 ../story/mother-of-4435 而不是 /clickToGive/bcs/story/mother-of-4435。这就是它找不到任何 link 的原因。

Answer 2

如果您愿意使用 scrapy.spider class，您可以手动关注下一页，例如： next_page = response.css('a.pageLink ::attr(href)').extract_first() 如果 next_page: 绝对_next_page_url = response.urljoin(next_page) yield scrapy.Request(url=absolute_next_page_url, callback=self.parse) 如果您想使用 CralwSpider class

，请不要忘记将您的解析方法重命名为 parse_start_url

如何在 Scrapy Crawler 中跟随下一页来抓取内容

How to follow next pages in Scrapy Crawler to scrape content

web-crawler

scrapy

python-2.7