使用Scrapy下载图片的问题
Trouble with downloading images using Scrapy
我在尝试使用带有 Scrapy 的蜘蛛下载图像时遇到以下错误。
File "C:\Python27\lib\site-packages\scrapy\http\request\__init__.py",
line 61, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
exceptions.ValueError: Missing scheme in request url: h
据我所知,我好像在某处 url 中缺少 "h"?但我一辈子都看不到在哪里。如果我不尝试下载图像,一切正常。但是一旦我将适当的代码添加到下面的四个文件中,我就无法正常工作。谁能帮我理解这个错误?
items.py
import scrapy
class ProductItem(scrapy.Item):
model = scrapy.Field()
shortdesc = scrapy.Field()
desc = scrapy.Field()
series = scrapy.Field()
imageorig = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
settings.py
BOT_NAME = 'allenheath'
SPIDER_MODULES = ['allenheath.spiders']
NEWSPIDER_MODULE = 'allenheath.spiders'
ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}
IMAGES_STORE = 'c:/allenheath/images'
pipelines.py
class AllenheathPipeline(object):
def process_item(self, item, spider):
return item
import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
products.py(我的蜘蛛)
import scrapy
from allenheath.items import ProductItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
class productsSpider(scrapy.Spider):
name = "products"
allowed_domains = ["http://www.allen-heath.com/"]
start_urls = [
"http://www.allen-heath.com/ahproducts/ilive-80/",
"http://www.allen-heath.com/ahproducts/ilive-112/"
]
def parse(self, response):
for sel in response.xpath('/html'):
item = ProductItem()
item['model'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
item['shortdesc'] = sel.css('#prodsingleouter > div > div > h3::text').extract()
item['desc'] = sel.css('#tab1 #productcontent').extract()
item['series'] = sel.css('#pagestrip > div > div > a:nth-child(3)::text').extract()
item['imageorig'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
item['image_urls'] = sel.css('#tab1 #productcontent img').extract()[0]
item['image_urls'] = 'http://www.allen-heath.com' + item['image_urls']
yield item
如有任何帮助,我们将不胜感激。
问题在这里:
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
这里:
item['image_urls'] = sel.css('#tab1 #productcontent img').extract()[0]
您正在提取该字段并取第一个元素。这意味着一旦你在管道中迭代它,你实际上是在迭代 URL 中的字符,它以 http
开头 - 解释你看到的错误消息,一旦第一个尝试处理信件:
Missing scheme in request url: h
从行中删除 [0]
。当你这样做时,获取图像的 src
,而不是整个元素:
item['image_urls'] = sel.css('#tab1 #productcontent img').xpath('./@src').extract()
之后,您还应该更新下一行,以防图像 url 是相对图像,将其转换为绝对图像:
import urlparse # put this at the top of the script
item['image_urls'] = [urlparse.urljoin(response.url, url) for url in item['image_urls']]
但是如果 src
中的图像 URL 实际上是绝对的,则不需要最后这一部分,因此只需将其删除。
我在尝试使用带有 Scrapy 的蜘蛛下载图像时遇到以下错误。
File "C:\Python27\lib\site-packages\scrapy\http\request\__init__.py",
line 61, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
exceptions.ValueError: Missing scheme in request url: h
据我所知,我好像在某处 url 中缺少 "h"?但我一辈子都看不到在哪里。如果我不尝试下载图像,一切正常。但是一旦我将适当的代码添加到下面的四个文件中,我就无法正常工作。谁能帮我理解这个错误?
items.py
import scrapy
class ProductItem(scrapy.Item):
model = scrapy.Field()
shortdesc = scrapy.Field()
desc = scrapy.Field()
series = scrapy.Field()
imageorig = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
settings.py
BOT_NAME = 'allenheath'
SPIDER_MODULES = ['allenheath.spiders']
NEWSPIDER_MODULE = 'allenheath.spiders'
ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}
IMAGES_STORE = 'c:/allenheath/images'
pipelines.py
class AllenheathPipeline(object):
def process_item(self, item, spider):
return item
import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
products.py(我的蜘蛛)
import scrapy
from allenheath.items import ProductItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
class productsSpider(scrapy.Spider):
name = "products"
allowed_domains = ["http://www.allen-heath.com/"]
start_urls = [
"http://www.allen-heath.com/ahproducts/ilive-80/",
"http://www.allen-heath.com/ahproducts/ilive-112/"
]
def parse(self, response):
for sel in response.xpath('/html'):
item = ProductItem()
item['model'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
item['shortdesc'] = sel.css('#prodsingleouter > div > div > h3::text').extract()
item['desc'] = sel.css('#tab1 #productcontent').extract()
item['series'] = sel.css('#pagestrip > div > div > a:nth-child(3)::text').extract()
item['imageorig'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
item['image_urls'] = sel.css('#tab1 #productcontent img').extract()[0]
item['image_urls'] = 'http://www.allen-heath.com' + item['image_urls']
yield item
如有任何帮助,我们将不胜感激。
问题在这里:
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
这里:
item['image_urls'] = sel.css('#tab1 #productcontent img').extract()[0]
您正在提取该字段并取第一个元素。这意味着一旦你在管道中迭代它,你实际上是在迭代 URL 中的字符,它以 http
开头 - 解释你看到的错误消息,一旦第一个尝试处理信件:
Missing scheme in request url: h
从行中删除 [0]
。当你这样做时,获取图像的 src
,而不是整个元素:
item['image_urls'] = sel.css('#tab1 #productcontent img').xpath('./@src').extract()
之后,您还应该更新下一行,以防图像 url 是相对图像,将其转换为绝对图像:
import urlparse # put this at the top of the script
item['image_urls'] = [urlparse.urljoin(response.url, url) for url in item['image_urls']]
但是如果 src
中的图像 URL 实际上是绝对的,则不需要最后这一部分,因此只需将其删除。