使用项目字段中的内容重命名 Scrapy 0.24 中下载的图像,同时避免文件名冲突?
Renaming downloaded images in Scrapy 0.24 with content from an item field while avoiding filename conflicts?
我正在尝试重命名我的 Scrapy 0.24 蜘蛛下载的图像。现在,下载的图像以 URL 的 SHA1 散列作为文件名存储。我想将它们命名为我用 item['model']
提取的值。 This question from 2011 outlines what I want,但答案适用于以前版本的 Scrapy,不适用于最新版本。
一旦我成功完成这项工作,我还需要确保我考虑到使用相同文件名下载的不同图像。所以我需要将每张图片下载到它自己唯一命名的文件夹中,大概是基于原始 URL.
这是我在管道中使用的代码的副本。 I got this code from a more recent answer 在上面的 link 中,但它对我不起作用。没有任何错误,图像可以正常下载。我的额外代码似乎对文件名没有任何影响,因为它们仍然显示为 SHA1 哈希值。
pipelines.py
class AllenheathPipeline(object):
def process_item(self, item, spider):
return item
import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.http import Request
from scrapy.exceptions import DropItem
class MyImagesPipeline(ImagesPipeline):
#Name download version
def file_path(self, request, response=None, info=None):
item=request.meta['item'] # Like this you can use all from item, not just url.
image_guid = request.url.split('/')[-1]
return 'full/%s' % (image_guid)
#Name thumbnail version
def thumb_path(self, request, thumb_id, response=None, info=None):
image_guid = thumb_id + request.url.split('/')[-1]
return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)
def get_media_requests(self, item, info):
#yield Request(item['images']) # Adding meta. I don't know, how to put it in one line :-)
for image in item['images']:
yield Request(image)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
settings.py
BOT_NAME = 'allenheath'
SPIDER_MODULES = ['allenheath.spiders']
NEWSPIDER_MODULE = 'allenheath.spiders'
ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}
IMAGES_STORE = 'c:/allenheath/images'
products.py(我的蜘蛛)
import scrapy
import urlparse
from allenheath.items import ProductItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
class productsSpider(scrapy.Spider):
name = "products"
allowed_domains = ["http://www.allen-heath.com/"]
start_urls = [
"http://www.allen-heath.com/ahproducts/ilive-80/",
"http://www.allen-heath.com/ahproducts/ilive-112/"
]
def parse(self, response):
for sel in response.xpath('/html'):
item = ProductItem()
item['model'] = sel.css('#prodsingleouter > div > div > h2::text').extract() # The value I'd like to use to name my images.
item['shortdesc'] = sel.css('#prodsingleouter > div > div > h3::text').extract()
item['desc'] = sel.css('#tab1 #productcontent').extract()
item['series'] = sel.css('#pagestrip > div > div > a:nth-child(3)::text').extract()
item['imageorig'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
item['image_urls'] = sel.css('#tab1 #productcontent .col-sm-9 img').xpath('./@src').extract()
item['image_urls'] = [urlparse.urljoin(response.url, url) for url in item['image_urls']]
yield item
items.py
import scrapy
class ProductItem(scrapy.Item):
model = scrapy.Field()
itemcode = scrapy.Field()
shortdesc = scrapy.Field()
desc = scrapy.Field()
series = scrapy.Field()
imageorig = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
这是我 运行 蜘蛛时从命令提示符获得的输出的 pastebin:http://pastebin.com/ir7YZFqf
如有任何帮助,我们将不胜感激!
pipelines.py:
from scrapy.pipelines.images import ImagesPipeline
from scrapy.http import Request
from scrapy.exceptions import DropItem
from scrapy import log
class MyImagesPipeline(ImagesPipeline):
#Name download version
def file_path(self, request, response=None, info=None):
image_guid = request.meta['model'][0]
log.msg(image_guid, level=log.DEBUG)
return 'full/%s' % (image_guid)
#Name thumbnail version
def thumb_path(self, request, thumb_id, response=None, info=None):
image_guid = thumb_id + request.url.split('/')[-1]
log.msg(image_guid, level=log.DEBUG)
return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)
def get_media_requests(self, item, info):
yield Request(item['image_urls'][0], meta=item)
您使用的 settings.py
有误。你应该使用这个:
ITEM_PIPELINES = {'allenheath.pipelines.MyImagesPipeline': 1}
要使缩略图正常工作,请将此添加到 settings.py
:
IMAGES_THUMBS = {
'small': (50, 50),
'big': (100, 100),
}
由于 URL 散列将确保您最终得到一个唯一标识符,您也许可以将项目的值和 URL 散列单独写入文件。
完成所有操作后,您可以循环遍历此文件并进行重命名(并使用 Counter 字典确保重命名它们时根据具有相同值的项目的数量附加了一个数字)。
我正在尝试重命名我的 Scrapy 0.24 蜘蛛下载的图像。现在,下载的图像以 URL 的 SHA1 散列作为文件名存储。我想将它们命名为我用 item['model']
提取的值。 This question from 2011 outlines what I want,但答案适用于以前版本的 Scrapy,不适用于最新版本。
一旦我成功完成这项工作,我还需要确保我考虑到使用相同文件名下载的不同图像。所以我需要将每张图片下载到它自己唯一命名的文件夹中,大概是基于原始 URL.
这是我在管道中使用的代码的副本。 I got this code from a more recent answer 在上面的 link 中,但它对我不起作用。没有任何错误,图像可以正常下载。我的额外代码似乎对文件名没有任何影响,因为它们仍然显示为 SHA1 哈希值。
pipelines.py
class AllenheathPipeline(object):
def process_item(self, item, spider):
return item
import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.http import Request
from scrapy.exceptions import DropItem
class MyImagesPipeline(ImagesPipeline):
#Name download version
def file_path(self, request, response=None, info=None):
item=request.meta['item'] # Like this you can use all from item, not just url.
image_guid = request.url.split('/')[-1]
return 'full/%s' % (image_guid)
#Name thumbnail version
def thumb_path(self, request, thumb_id, response=None, info=None):
image_guid = thumb_id + request.url.split('/')[-1]
return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)
def get_media_requests(self, item, info):
#yield Request(item['images']) # Adding meta. I don't know, how to put it in one line :-)
for image in item['images']:
yield Request(image)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
settings.py
BOT_NAME = 'allenheath'
SPIDER_MODULES = ['allenheath.spiders']
NEWSPIDER_MODULE = 'allenheath.spiders'
ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}
IMAGES_STORE = 'c:/allenheath/images'
products.py(我的蜘蛛)
import scrapy
import urlparse
from allenheath.items import ProductItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
class productsSpider(scrapy.Spider):
name = "products"
allowed_domains = ["http://www.allen-heath.com/"]
start_urls = [
"http://www.allen-heath.com/ahproducts/ilive-80/",
"http://www.allen-heath.com/ahproducts/ilive-112/"
]
def parse(self, response):
for sel in response.xpath('/html'):
item = ProductItem()
item['model'] = sel.css('#prodsingleouter > div > div > h2::text').extract() # The value I'd like to use to name my images.
item['shortdesc'] = sel.css('#prodsingleouter > div > div > h3::text').extract()
item['desc'] = sel.css('#tab1 #productcontent').extract()
item['series'] = sel.css('#pagestrip > div > div > a:nth-child(3)::text').extract()
item['imageorig'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
item['image_urls'] = sel.css('#tab1 #productcontent .col-sm-9 img').xpath('./@src').extract()
item['image_urls'] = [urlparse.urljoin(response.url, url) for url in item['image_urls']]
yield item
items.py
import scrapy
class ProductItem(scrapy.Item):
model = scrapy.Field()
itemcode = scrapy.Field()
shortdesc = scrapy.Field()
desc = scrapy.Field()
series = scrapy.Field()
imageorig = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
这是我 运行 蜘蛛时从命令提示符获得的输出的 pastebin:http://pastebin.com/ir7YZFqf
如有任何帮助,我们将不胜感激!
pipelines.py:
from scrapy.pipelines.images import ImagesPipeline
from scrapy.http import Request
from scrapy.exceptions import DropItem
from scrapy import log
class MyImagesPipeline(ImagesPipeline):
#Name download version
def file_path(self, request, response=None, info=None):
image_guid = request.meta['model'][0]
log.msg(image_guid, level=log.DEBUG)
return 'full/%s' % (image_guid)
#Name thumbnail version
def thumb_path(self, request, thumb_id, response=None, info=None):
image_guid = thumb_id + request.url.split('/')[-1]
log.msg(image_guid, level=log.DEBUG)
return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)
def get_media_requests(self, item, info):
yield Request(item['image_urls'][0], meta=item)
您使用的 settings.py
有误。你应该使用这个:
ITEM_PIPELINES = {'allenheath.pipelines.MyImagesPipeline': 1}
要使缩略图正常工作,请将此添加到 settings.py
:
IMAGES_THUMBS = {
'small': (50, 50),
'big': (100, 100),
}
由于 URL 散列将确保您最终得到一个唯一标识符,您也许可以将项目的值和 URL 散列单独写入文件。
完成所有操作后,您可以循环遍历此文件并进行重命名(并使用 Counter 字典确保重命名它们时根据具有相同值的项目的数量附加了一个数字)。