Scrapy - 抓取概览页面和详细信息页面?

Scrapy - scraping overview-page and detail-page?

我尝试使用 scrapy 抓取以下站点 -

当我只从概览页面抓取信息时它工作正常 (如名称、价格、link) 它 returns 我 1535 行。

import scrapy

class WhiskeySpider(scrapy.Spider):
  name = "whisky"
  allowed_domains = ["whiskyshop.com"]
  start_urls = ["https://www.whiskyshop.com/scotch-whisky"]

  def parse(self, response):
    for products in response.css("div.product-item-info"):
      tmpPrice = products.css("span.price::text").get()
      if tmpPrice == None:
        tmpPrice = "Sold Out"
      else:
        tmpPrice = tmpPrice.replace("\u00a3",""),
      yield {
        "name": products.css("a.product-item-link::text").get(),
        "price": tmpPrice,
        "link": products.css("a.product-item-link").attrib["href"],
      }
    
    nextPage = response.css("a.action.next").attrib["href"]
    if nextPage != None:
      nextPage = response.urljoin(nextPage)
      yield response.follow(nextPage, callback=self.parse)

现在我还想为每个项目抓取一些额外的详细信息 (如升、百分比、面积)我想在一行中包含 3 个主要信息和 3 个详细信息

我尝试使用以下代码 - 但效果不佳:

import scrapy

class WhiskeySpider(scrapy.Spider):
  name = "whiskyDetail"
  allowed_domains = ["whiskyshop.com"]
  start_urls = ["https://www.whiskyshop.com/scotch-whisky"]

  def parse(self, response):
    for products in response.css("div.product-item-info"):
      tmpPrice = products.css("span.price::text").get()      
      tmpLink = products.css("a.product-item-link").attrib["href"]
      tmpLink = response.urljoin(tmpLink)
      
      if tmpPrice == None:
        tmpPrice = "Sold Out"
      else:
        tmpPrice = tmpPrice.replace("\u00a3",""),
      yield {
        "name": products.css("a.product-item-link::text").get(),
        "price": tmpPrice,
        "link": tmpLink,
      }

      yield scrapy.Request(url=tmpLink, callback=self.parseDetails)                    
    
    nextPage = response.css("a.action.next").attrib["href"]
    if nextPage != None:
      nextPage = response.urljoin(nextPage)
      yield response.follow(nextPage, callback=self.parse)
  
  def parseDetails(self, response):
    tmpDetails = response.css("p.product-info-size-abv span::text").getall()
    yield {
      "litre": tmpDetails[0],
      "percent": tmpDetails[1],
      "area": tmpDetails[2]
    }

代码似乎 运行 处于无限循环中 在日志中,我看到他有时会以 429 未知状态重试

2021-11-05 22:24:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.whiskyshop.com/benrinnes-10-year-old-that-boutique-y-whisky-company>
{'litre': '50cl', 'percent': '56.8% abv', 'area': 'Speyside'}
2021-11-05 22:24:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.whiskyshop.com/bruichladdich-28-year-old-batch-19-that-boutique-y-whisky-company> (referer: https://www.whiskyshop.com/scotch-whisky)
2021-11-05 22:24:17 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.whiskyshop.com/bruichladdich-28-year-old-batch-19-that-boutique-y-whisky-company>
{'litre': '50cl', 'percent': '48.5%% abv', 'area': 'Islay'}
2021-11-05 22:24:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/westport-21-year-old-batch-1-that-boutique-y-whisky-company> (failed 1 times): 429 Unknown Status
2021-11-05 22:24:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.whiskyshop.com/strathmill-22-year-old-batch-7-that-boutique-y-whisky-company> (referer: https://www.whiskyshop.com/scotch-whisky)
2021-11-05 22:24:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.whiskyshop.com/strathmill-22-year-old-batch-7-that-boutique-y-whisky-company>
{'litre': '50cl', 'percent': '49.6% abv', 'area': 'Speyside'}
2021-11-05 22:24:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/benromach-40-year-old> (failed 1 times): 429 Unknown Status
2021-11-05 22:24:22 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/monkey-shoulder-fever-tree-gift-pack> (failed 1 times): 429 Unknown Status
2021-11-05 22:24:25 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/catalog/product/view/id/21965/s/nc-nean-organic-single-malt/category/246/> (failed 1 times): 429 Unknown Status

在json-输出中,两个信息不在一行中(主要信息和详细信息):

{"name": "Port Charlotte Islay Barley 2013 ", "price": ["65.00"], "link": "https://www.whiskyshop.com/port-charlotte-islay-barley-2013"},
{"name": "Bruichladdich Bere Barley 2011 ", "price": ["70.00"], "link": "https://www.whiskyshop.com/bruichladdich-bere-barley-2011"},
{"name": "Glen Grant 1950 68 Year Old ", "price": ["4,999.99"], "link": "https://www.whiskyshop.com/glen-grant-1950-68-year-old"},
{"name": "Linkwood 1981 Private Collection ", "price": ["1,250.00"], "link": "https://www.whiskyshop.com/linkwood-1981-private-collection"},
{"name": "Linkwood 1980 40 Year Old Private Collection ", "price": ["999.99"], "link": "https://www.whiskyshop.com/linkwood-1980-40-year-old-private-collection"},
{"name": "Dimensions Linkwood 2009 12 Year Old", "price": ["89.99"], "link": "https://www.whiskyshop.com/dimensions-linkwood-2009-12-year-old"},
{"name": "Dimensions Highland Park 2007 13 Year Old", "price": ["114.00"], "link": "https://www.whiskyshop.com/dimensions-highland-park-2007-13-year-old"},
{"litre": "70cl", "percent": "54.9% abv", "area": "Highland"},
{"litre": "70cl", "percent": "54.7% abv", "area": "Islay"},
{"litre": "70cl", "percent": "46% abv", "area": "Highland"},
{"litre": "70cl", "percent": "52.1% abv", "area": "Islay"},
{"litre": "70cl", "percent": "43% abv", "area": "Speyside"},
{"litre": "70cl", "percent": "43% abv", "area": "Highland"},

我做错了什么,我怎样才能在一行中获取主要信息和详细信息? (并且没有重试错误)

你必须下载延迟,否则被 429 状态阻止 code/Retrying/connection 丢失等等。 我的 settings.py 文件:

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 4

在概览页面和详细页面的情况下,另一种最简单的获取数据的方法是使用 CrawlSpider。我在 start_urls 中进行了分页,您可以增加或减少页码范围,无论您需要什么。这里每页​​包含 100 个项目。

代码:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class ShopSpider(CrawlSpider):
    name = 'shop'
    start_urls = ['https://www.whiskyshop.com/scotch-whisky?p='+str(x)+''for x in range(1,5)]

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//a[@class="product-item-link"]'),callback='parse', follow=False),)

    def parse(self, response):
        yield {
            'Name': response.xpath('//h1[@class="page-title"]/text()').get().strip(),
            'Price':response.xpath('(//span[@class="price"])[1]/text()').get(),
            'Litre':response.xpath('(//*[@class="product-info-size-abv"]/span)[1]/text()').get(),
            'Percent':response.xpath('(//*[@class="product-info-size-abv"]/span)[2]/text()').get(),
            'Area':response.xpath('(//*[@class="product-info-size-abv"]/span)[3]/text()').get(),
            'LINK': response.url}