使用Scrapy下载文件时遇到问题
Trouble downloading files using Scrapy
我正在尝试从 table 中提取数据,其中显示了来自 this site 的有效出价列表。我是一个 Scrapy 新手,对于为什么我没有下载文件有点困惑。我能够输出文件网址,但仍然无法从列出的网址下载文件。我无法弄清楚我缺少或需要更改的内容。对此的任何帮助将不胜感激!
谢谢!
到目前为止我有以下代码:
这是我的 蜘蛛:
from government.items import GovernmentItem
import scrapy, urllib.parse
import scrapy
from government.items import GovernmentItem
class AlabamaSpider(scrapy.Spider):
name = 'alabama'
allowed_domains = ['purchasing.alabama.gov']
def start_requests(self):
url = 'https://purchasing.alabama.gov/active-statewide-contracts/'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for row in response.xpath('//*[@class="table table-bordered table-responsive-sm"]//tbody//tr'):
yield {
'Description': row.xpath('normalize-space(./td[@class="col-sm-5"])').extract_first(),
'Bid File': row.xpath('td[@class="col-sm-1"]/a//@href').extract_first(),
'Begin Date': row.xpath('normalize-space(./td[@class="col-sm-1"][2])').extract_first(),
'End Date': row.xpath('normalize-space(./td[@class="col-sm-1"][3])').extract_first(),
'Buyer Name': row.xpath('td[@class="col-sm-3"]/a//text()').extract_first(),
'Vendor Websites': row.xpath('td[@class="col-sm-1"]/label/text()').extract_first(),
}
def parse_item(self, response):
file_url = response.xpath('td[@class="col-sm-1"]/a//@href').get()
#file_url = response.urljoin(file_url)
item = GovernmentItem()
item['file_urls'] = [file_url]
yield item
这里是items.py:
from scrapy.item import Item, Field
import scrapy
class GovernmentItem(Item):
file_urls = Field()
files = Field()
这是我的settings.py:
BOT_NAME = 'government'
SPIDER_MODULES = ['government.spiders']
NEWSPIDER_MODULE = 'government.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure item pipelines
ITEM_PIPELINES = {
'government.pipelines.GovernmentPipeline': 1,
'scrapy.pipelines.files.FilesPipeline': 1,
}
FILES_STORE = '/home/ken/Desktop/Projects/scrapy/government'
FILES_URL_FIELD = 'field_urls'
FILES_RESULT_FIELD = 'files'
MEDIA_ALLOW_REDIRECTS = True
DOWNLOAD_DELAY = 1
你加了吗
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = '/path/to/valid/dir'
到settings.py?
您的代码存在一些问题:
- 您永远不会调用“parse_item”函数
file_url = response.xpath('td[@class="col-sm-1"]/a//@href').get()
将 return none。你忘了在开头加上'//'。
- 您需要单独下载每个文件。所以用getall()获取下载链接,然后一一处理。
更正后的代码:
def parse_all_items(self, response):
all_urls = response.xpath('//td[@class="col-sm-1"]/a//@href').getall()
base_url = 'https://purchasing.alabama.gov'
for url in all_urls:
item = GovernmentItem()
item['file_urls'] = [base_url + url]
yield item
它将下载所有文件。
只要确保您记得调用该函数即可。
备选方案:使用已有的解析函数:
def parse(self, response):
base_url = 'https://purchasing.alabama.gov'
for row in response.xpath('//*[@class="table table-bordered table-responsive-sm"]//tbody//tr'):
url = row.xpath('td[@class="col-sm-1"]/a//@href').extract_first()
yield {
'Description': row.xpath('normalize-space(./td[@class="col-sm-5"])').extract_first(),
'Bid File': row.xpath('td[@class="col-sm-1"]/a//@href').extract_first(),
'Begin Date': row.xpath('normalize-space(./td[@class="col-sm-1"][2])').extract_first(),
'End Date': row.xpath('normalize-space(./td[@class="col-sm-1"][3])').extract_first(),
'Buyer Name': row.xpath('td[@class="col-sm-3"]/a//text()').extract_first(),
'Vendor Websites': row.xpath('td[@class="col-sm-1"]/label/text()').extract_first(),
}
if url:
item = GovernmentItem()
item['file_urls'] = [base_url + url]
yield item
我正在尝试从 table 中提取数据,其中显示了来自 this site 的有效出价列表。我是一个 Scrapy 新手,对于为什么我没有下载文件有点困惑。我能够输出文件网址,但仍然无法从列出的网址下载文件。我无法弄清楚我缺少或需要更改的内容。对此的任何帮助将不胜感激!
谢谢!
到目前为止我有以下代码:
这是我的 蜘蛛:
from government.items import GovernmentItem
import scrapy, urllib.parse
import scrapy
from government.items import GovernmentItem
class AlabamaSpider(scrapy.Spider):
name = 'alabama'
allowed_domains = ['purchasing.alabama.gov']
def start_requests(self):
url = 'https://purchasing.alabama.gov/active-statewide-contracts/'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for row in response.xpath('//*[@class="table table-bordered table-responsive-sm"]//tbody//tr'):
yield {
'Description': row.xpath('normalize-space(./td[@class="col-sm-5"])').extract_first(),
'Bid File': row.xpath('td[@class="col-sm-1"]/a//@href').extract_first(),
'Begin Date': row.xpath('normalize-space(./td[@class="col-sm-1"][2])').extract_first(),
'End Date': row.xpath('normalize-space(./td[@class="col-sm-1"][3])').extract_first(),
'Buyer Name': row.xpath('td[@class="col-sm-3"]/a//text()').extract_first(),
'Vendor Websites': row.xpath('td[@class="col-sm-1"]/label/text()').extract_first(),
}
def parse_item(self, response):
file_url = response.xpath('td[@class="col-sm-1"]/a//@href').get()
#file_url = response.urljoin(file_url)
item = GovernmentItem()
item['file_urls'] = [file_url]
yield item
这里是items.py:
from scrapy.item import Item, Field
import scrapy
class GovernmentItem(Item):
file_urls = Field()
files = Field()
这是我的settings.py:
BOT_NAME = 'government'
SPIDER_MODULES = ['government.spiders']
NEWSPIDER_MODULE = 'government.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure item pipelines
ITEM_PIPELINES = {
'government.pipelines.GovernmentPipeline': 1,
'scrapy.pipelines.files.FilesPipeline': 1,
}
FILES_STORE = '/home/ken/Desktop/Projects/scrapy/government'
FILES_URL_FIELD = 'field_urls'
FILES_RESULT_FIELD = 'files'
MEDIA_ALLOW_REDIRECTS = True
DOWNLOAD_DELAY = 1
你加了吗
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = '/path/to/valid/dir'
到settings.py?
您的代码存在一些问题:
- 您永远不会调用“parse_item”函数
file_url = response.xpath('td[@class="col-sm-1"]/a//@href').get()
将 return none。你忘了在开头加上'//'。- 您需要单独下载每个文件。所以用getall()获取下载链接,然后一一处理。
更正后的代码:
def parse_all_items(self, response):
all_urls = response.xpath('//td[@class="col-sm-1"]/a//@href').getall()
base_url = 'https://purchasing.alabama.gov'
for url in all_urls:
item = GovernmentItem()
item['file_urls'] = [base_url + url]
yield item
它将下载所有文件。 只要确保您记得调用该函数即可。
备选方案:使用已有的解析函数:
def parse(self, response):
base_url = 'https://purchasing.alabama.gov'
for row in response.xpath('//*[@class="table table-bordered table-responsive-sm"]//tbody//tr'):
url = row.xpath('td[@class="col-sm-1"]/a//@href').extract_first()
yield {
'Description': row.xpath('normalize-space(./td[@class="col-sm-5"])').extract_first(),
'Bid File': row.xpath('td[@class="col-sm-1"]/a//@href').extract_first(),
'Begin Date': row.xpath('normalize-space(./td[@class="col-sm-1"][2])').extract_first(),
'End Date': row.xpath('normalize-space(./td[@class="col-sm-1"][3])').extract_first(),
'Buyer Name': row.xpath('td[@class="col-sm-3"]/a//text()').extract_first(),
'Vendor Websites': row.xpath('td[@class="col-sm-1"]/label/text()').extract_first(),
}
if url:
item = GovernmentItem()
item['file_urls'] = [base_url + url]
yield item