Scrapy - 抓取概览页面和详细信息页面?
Scrapy - scraping overview-page and detail-page?
我尝试使用 scrapy 抓取以下站点 -
当我只从概览页面抓取信息时它工作正常
(如名称、价格、link)
它 returns 我 1535 行。
import scrapy
class WhiskeySpider(scrapy.Spider):
name = "whisky"
allowed_domains = ["whiskyshop.com"]
start_urls = ["https://www.whiskyshop.com/scotch-whisky"]
def parse(self, response):
for products in response.css("div.product-item-info"):
tmpPrice = products.css("span.price::text").get()
if tmpPrice == None:
tmpPrice = "Sold Out"
else:
tmpPrice = tmpPrice.replace("\u00a3",""),
yield {
"name": products.css("a.product-item-link::text").get(),
"price": tmpPrice,
"link": products.css("a.product-item-link").attrib["href"],
}
nextPage = response.css("a.action.next").attrib["href"]
if nextPage != None:
nextPage = response.urljoin(nextPage)
yield response.follow(nextPage, callback=self.parse)
现在我还想为每个项目抓取一些额外的详细信息
(如升、百分比、面积)我想在一行中包含 3 个主要信息和 3 个详细信息
我尝试使用以下代码 - 但效果不佳:
import scrapy
class WhiskeySpider(scrapy.Spider):
name = "whiskyDetail"
allowed_domains = ["whiskyshop.com"]
start_urls = ["https://www.whiskyshop.com/scotch-whisky"]
def parse(self, response):
for products in response.css("div.product-item-info"):
tmpPrice = products.css("span.price::text").get()
tmpLink = products.css("a.product-item-link").attrib["href"]
tmpLink = response.urljoin(tmpLink)
if tmpPrice == None:
tmpPrice = "Sold Out"
else:
tmpPrice = tmpPrice.replace("\u00a3",""),
yield {
"name": products.css("a.product-item-link::text").get(),
"price": tmpPrice,
"link": tmpLink,
}
yield scrapy.Request(url=tmpLink, callback=self.parseDetails)
nextPage = response.css("a.action.next").attrib["href"]
if nextPage != None:
nextPage = response.urljoin(nextPage)
yield response.follow(nextPage, callback=self.parse)
def parseDetails(self, response):
tmpDetails = response.css("p.product-info-size-abv span::text").getall()
yield {
"litre": tmpDetails[0],
"percent": tmpDetails[1],
"area": tmpDetails[2]
}
代码似乎 运行 处于无限循环中
在日志中,我看到他有时会以 429 未知状态重试
2021-11-05 22:24:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.whiskyshop.com/benrinnes-10-year-old-that-boutique-y-whisky-company>
{'litre': '50cl', 'percent': '56.8% abv', 'area': 'Speyside'}
2021-11-05 22:24:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.whiskyshop.com/bruichladdich-28-year-old-batch-19-that-boutique-y-whisky-company> (referer: https://www.whiskyshop.com/scotch-whisky)
2021-11-05 22:24:17 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.whiskyshop.com/bruichladdich-28-year-old-batch-19-that-boutique-y-whisky-company>
{'litre': '50cl', 'percent': '48.5%% abv', 'area': 'Islay'}
2021-11-05 22:24:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/westport-21-year-old-batch-1-that-boutique-y-whisky-company> (failed 1 times): 429 Unknown Status
2021-11-05 22:24:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.whiskyshop.com/strathmill-22-year-old-batch-7-that-boutique-y-whisky-company> (referer: https://www.whiskyshop.com/scotch-whisky)
2021-11-05 22:24:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.whiskyshop.com/strathmill-22-year-old-batch-7-that-boutique-y-whisky-company>
{'litre': '50cl', 'percent': '49.6% abv', 'area': 'Speyside'}
2021-11-05 22:24:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/benromach-40-year-old> (failed 1 times): 429 Unknown Status
2021-11-05 22:24:22 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/monkey-shoulder-fever-tree-gift-pack> (failed 1 times): 429 Unknown Status
2021-11-05 22:24:25 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/catalog/product/view/id/21965/s/nc-nean-organic-single-malt/category/246/> (failed 1 times): 429 Unknown Status
在json-输出中,两个信息不在一行中(主要信息和详细信息):
{"name": "Port Charlotte Islay Barley 2013 ", "price": ["65.00"], "link": "https://www.whiskyshop.com/port-charlotte-islay-barley-2013"},
{"name": "Bruichladdich Bere Barley 2011 ", "price": ["70.00"], "link": "https://www.whiskyshop.com/bruichladdich-bere-barley-2011"},
{"name": "Glen Grant 1950 68 Year Old ", "price": ["4,999.99"], "link": "https://www.whiskyshop.com/glen-grant-1950-68-year-old"},
{"name": "Linkwood 1981 Private Collection ", "price": ["1,250.00"], "link": "https://www.whiskyshop.com/linkwood-1981-private-collection"},
{"name": "Linkwood 1980 40 Year Old Private Collection ", "price": ["999.99"], "link": "https://www.whiskyshop.com/linkwood-1980-40-year-old-private-collection"},
{"name": "Dimensions Linkwood 2009 12 Year Old", "price": ["89.99"], "link": "https://www.whiskyshop.com/dimensions-linkwood-2009-12-year-old"},
{"name": "Dimensions Highland Park 2007 13 Year Old", "price": ["114.00"], "link": "https://www.whiskyshop.com/dimensions-highland-park-2007-13-year-old"},
{"litre": "70cl", "percent": "54.9% abv", "area": "Highland"},
{"litre": "70cl", "percent": "54.7% abv", "area": "Islay"},
{"litre": "70cl", "percent": "46% abv", "area": "Highland"},
{"litre": "70cl", "percent": "52.1% abv", "area": "Islay"},
{"litre": "70cl", "percent": "43% abv", "area": "Speyside"},
{"litre": "70cl", "percent": "43% abv", "area": "Highland"},
我做错了什么,我怎样才能在一行中获取主要信息和详细信息?
(并且没有重试错误)
你必须下载延迟,否则被 429 状态阻止 code/Retrying/connection 丢失等等。
我的 settings.py 文件:
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 4
在概览页面和详细页面的情况下,另一种最简单的获取数据的方法是使用 CrawlSpider
。我在 start_urls 中进行了分页,您可以增加或减少页码范围,无论您需要什么。这里每页包含 100 个项目。
代码:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ShopSpider(CrawlSpider):
name = 'shop'
start_urls = ['https://www.whiskyshop.com/scotch-whisky?p='+str(x)+''for x in range(1,5)]
rules = (
Rule(LinkExtractor(restrict_xpaths='//a[@class="product-item-link"]'),callback='parse', follow=False),)
def parse(self, response):
yield {
'Name': response.xpath('//h1[@class="page-title"]/text()').get().strip(),
'Price':response.xpath('(//span[@class="price"])[1]/text()').get(),
'Litre':response.xpath('(//*[@class="product-info-size-abv"]/span)[1]/text()').get(),
'Percent':response.xpath('(//*[@class="product-info-size-abv"]/span)[2]/text()').get(),
'Area':response.xpath('(//*[@class="product-info-size-abv"]/span)[3]/text()').get(),
'LINK': response.url}
我尝试使用 scrapy 抓取以下站点 -
当我只从概览页面抓取信息时它工作正常 (如名称、价格、link) 它 returns 我 1535 行。
import scrapy
class WhiskeySpider(scrapy.Spider):
name = "whisky"
allowed_domains = ["whiskyshop.com"]
start_urls = ["https://www.whiskyshop.com/scotch-whisky"]
def parse(self, response):
for products in response.css("div.product-item-info"):
tmpPrice = products.css("span.price::text").get()
if tmpPrice == None:
tmpPrice = "Sold Out"
else:
tmpPrice = tmpPrice.replace("\u00a3",""),
yield {
"name": products.css("a.product-item-link::text").get(),
"price": tmpPrice,
"link": products.css("a.product-item-link").attrib["href"],
}
nextPage = response.css("a.action.next").attrib["href"]
if nextPage != None:
nextPage = response.urljoin(nextPage)
yield response.follow(nextPage, callback=self.parse)
现在我还想为每个项目抓取一些额外的详细信息 (如升、百分比、面积)我想在一行中包含 3 个主要信息和 3 个详细信息
我尝试使用以下代码 - 但效果不佳:
import scrapy
class WhiskeySpider(scrapy.Spider):
name = "whiskyDetail"
allowed_domains = ["whiskyshop.com"]
start_urls = ["https://www.whiskyshop.com/scotch-whisky"]
def parse(self, response):
for products in response.css("div.product-item-info"):
tmpPrice = products.css("span.price::text").get()
tmpLink = products.css("a.product-item-link").attrib["href"]
tmpLink = response.urljoin(tmpLink)
if tmpPrice == None:
tmpPrice = "Sold Out"
else:
tmpPrice = tmpPrice.replace("\u00a3",""),
yield {
"name": products.css("a.product-item-link::text").get(),
"price": tmpPrice,
"link": tmpLink,
}
yield scrapy.Request(url=tmpLink, callback=self.parseDetails)
nextPage = response.css("a.action.next").attrib["href"]
if nextPage != None:
nextPage = response.urljoin(nextPage)
yield response.follow(nextPage, callback=self.parse)
def parseDetails(self, response):
tmpDetails = response.css("p.product-info-size-abv span::text").getall()
yield {
"litre": tmpDetails[0],
"percent": tmpDetails[1],
"area": tmpDetails[2]
}
代码似乎 运行 处于无限循环中 在日志中,我看到他有时会以 429 未知状态重试
2021-11-05 22:24:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.whiskyshop.com/benrinnes-10-year-old-that-boutique-y-whisky-company>
{'litre': '50cl', 'percent': '56.8% abv', 'area': 'Speyside'}
2021-11-05 22:24:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.whiskyshop.com/bruichladdich-28-year-old-batch-19-that-boutique-y-whisky-company> (referer: https://www.whiskyshop.com/scotch-whisky)
2021-11-05 22:24:17 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.whiskyshop.com/bruichladdich-28-year-old-batch-19-that-boutique-y-whisky-company>
{'litre': '50cl', 'percent': '48.5%% abv', 'area': 'Islay'}
2021-11-05 22:24:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/westport-21-year-old-batch-1-that-boutique-y-whisky-company> (failed 1 times): 429 Unknown Status
2021-11-05 22:24:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.whiskyshop.com/strathmill-22-year-old-batch-7-that-boutique-y-whisky-company> (referer: https://www.whiskyshop.com/scotch-whisky)
2021-11-05 22:24:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.whiskyshop.com/strathmill-22-year-old-batch-7-that-boutique-y-whisky-company>
{'litre': '50cl', 'percent': '49.6% abv', 'area': 'Speyside'}
2021-11-05 22:24:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/benromach-40-year-old> (failed 1 times): 429 Unknown Status
2021-11-05 22:24:22 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/monkey-shoulder-fever-tree-gift-pack> (failed 1 times): 429 Unknown Status
2021-11-05 22:24:25 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/catalog/product/view/id/21965/s/nc-nean-organic-single-malt/category/246/> (failed 1 times): 429 Unknown Status
在json-输出中,两个信息不在一行中(主要信息和详细信息):
{"name": "Port Charlotte Islay Barley 2013 ", "price": ["65.00"], "link": "https://www.whiskyshop.com/port-charlotte-islay-barley-2013"},
{"name": "Bruichladdich Bere Barley 2011 ", "price": ["70.00"], "link": "https://www.whiskyshop.com/bruichladdich-bere-barley-2011"},
{"name": "Glen Grant 1950 68 Year Old ", "price": ["4,999.99"], "link": "https://www.whiskyshop.com/glen-grant-1950-68-year-old"},
{"name": "Linkwood 1981 Private Collection ", "price": ["1,250.00"], "link": "https://www.whiskyshop.com/linkwood-1981-private-collection"},
{"name": "Linkwood 1980 40 Year Old Private Collection ", "price": ["999.99"], "link": "https://www.whiskyshop.com/linkwood-1980-40-year-old-private-collection"},
{"name": "Dimensions Linkwood 2009 12 Year Old", "price": ["89.99"], "link": "https://www.whiskyshop.com/dimensions-linkwood-2009-12-year-old"},
{"name": "Dimensions Highland Park 2007 13 Year Old", "price": ["114.00"], "link": "https://www.whiskyshop.com/dimensions-highland-park-2007-13-year-old"},
{"litre": "70cl", "percent": "54.9% abv", "area": "Highland"},
{"litre": "70cl", "percent": "54.7% abv", "area": "Islay"},
{"litre": "70cl", "percent": "46% abv", "area": "Highland"},
{"litre": "70cl", "percent": "52.1% abv", "area": "Islay"},
{"litre": "70cl", "percent": "43% abv", "area": "Speyside"},
{"litre": "70cl", "percent": "43% abv", "area": "Highland"},
我做错了什么,我怎样才能在一行中获取主要信息和详细信息? (并且没有重试错误)
你必须下载延迟,否则被 429 状态阻止 code/Retrying/connection 丢失等等。 我的 settings.py 文件:
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 4
在概览页面和详细页面的情况下,另一种最简单的获取数据的方法是使用 CrawlSpider
。我在 start_urls 中进行了分页,您可以增加或减少页码范围,无论您需要什么。这里每页包含 100 个项目。
代码:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ShopSpider(CrawlSpider):
name = 'shop'
start_urls = ['https://www.whiskyshop.com/scotch-whisky?p='+str(x)+''for x in range(1,5)]
rules = (
Rule(LinkExtractor(restrict_xpaths='//a[@class="product-item-link"]'),callback='parse', follow=False),)
def parse(self, response):
yield {
'Name': response.xpath('//h1[@class="page-title"]/text()').get().strip(),
'Price':response.xpath('(//span[@class="price"])[1]/text()').get(),
'Litre':response.xpath('(//*[@class="product-info-size-abv"]/span)[1]/text()').get(),
'Percent':response.xpath('(//*[@class="product-info-size-abv"]/span)[2]/text()').get(),
'Area':response.xpath('(//*[@class="product-info-size-abv"]/span)[3]/text()').get(),
'LINK': response.url}