Scrapy 蜘蛛不保存 html 个文件
Scrapy spider not saving html files
我有一个我生成的 Scrapy 蜘蛛。蜘蛛的目的是 return 网络数据以绘制网络图,以及 return 蜘蛛到达的每个页面的 html 文件。蜘蛛正在实现第一个目标,但没有实现第二个目标。它会生成一个包含跟踪信息的 csv 文件,但我看不到它正在保存 html 文件。
# -*- coding: utf-8 -*-
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.utils.url import urljoin_rfc
from sitegraph.items import SitegraphItem
class CrawlSpider(CrawlSpider):
name = "example"
custom_settings = {
'DEPTH_LIMIT': '1',
}
allowed_domains = []
start_urls = (
'http://exampleurl.com',
)
rules = (
Rule(LinkExtractor(allow=r'/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
i = SitegraphItem()
i['url'] = response.url
# i['http_status'] = response.status
llinks=[]
for anchor in hxs.select('//a[@href]'):
href=anchor.select('@href').extract()[0]
if not href.lower().startswith("javascript"):
llinks.append(urljoin_rfc(response.url,href))
i['linkedurls'] = llinks
return i
def parse(self, response):
filename = response.url.split("/")[-1] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
我收到的回溯如下:
Traceback (most recent call last):
File "...\Anaconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2019-07-23 14:16:41 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://externalurl.com/> (failed 3 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2019-07-23 14:16:41 [scrapy.core.scraper] ERROR: Error downloading <GET http://externalurl.com/>
Traceback (most recent call last):
File "...\Anaconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2019-07-23 14:16:41 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-23 14:16:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (153 items) in: exampledomainlevel1.csv
parse
方法:
根据 scrapy docs and 不建议覆盖 parse
方法,因为 crawlspider 使用它来实现它的逻辑。
如果您需要重写 parse
方法并同时计算
Crawlspider.parse
original source code - 您需要添加它的原始来源以修复 parse
方法:
def parse(self, response):
filename = response.url.split("/")[-1] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
csv 源:
此日志行:
2019-07-23 14:16:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (153 items) in: exampledomainlevel1.csv
- 表示 csv feedexporter
已启用(可能在 settings.py
项目设置文件中。)
更新
我又观察了Crawlspider
源代码。
看起来 parse
方法在开始时只调用了一次,并且没有涵盖所有网络响应。
如果我的理论是正确的 - 将此功能添加到您的蜘蛛后 class 应该保存所有 html 响应:
def _response_downloaded(self, response):
filename = response.url.split("/")[-1] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
rule = self._rules[response.meta['rule']]
return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
我有一个我生成的 Scrapy 蜘蛛。蜘蛛的目的是 return 网络数据以绘制网络图,以及 return 蜘蛛到达的每个页面的 html 文件。蜘蛛正在实现第一个目标,但没有实现第二个目标。它会生成一个包含跟踪信息的 csv 文件,但我看不到它正在保存 html 文件。
# -*- coding: utf-8 -*-
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.utils.url import urljoin_rfc
from sitegraph.items import SitegraphItem
class CrawlSpider(CrawlSpider):
name = "example"
custom_settings = {
'DEPTH_LIMIT': '1',
}
allowed_domains = []
start_urls = (
'http://exampleurl.com',
)
rules = (
Rule(LinkExtractor(allow=r'/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
i = SitegraphItem()
i['url'] = response.url
# i['http_status'] = response.status
llinks=[]
for anchor in hxs.select('//a[@href]'):
href=anchor.select('@href').extract()[0]
if not href.lower().startswith("javascript"):
llinks.append(urljoin_rfc(response.url,href))
i['linkedurls'] = llinks
return i
def parse(self, response):
filename = response.url.split("/")[-1] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
我收到的回溯如下:
Traceback (most recent call last):
File "...\Anaconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2019-07-23 14:16:41 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://externalurl.com/> (failed 3 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2019-07-23 14:16:41 [scrapy.core.scraper] ERROR: Error downloading <GET http://externalurl.com/>
Traceback (most recent call last):
File "...\Anaconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2019-07-23 14:16:41 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-23 14:16:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (153 items) in: exampledomainlevel1.csv
parse
方法:
根据 scrapy docs and parse
方法,因为 crawlspider 使用它来实现它的逻辑。
如果您需要重写 parse
方法并同时计算
Crawlspider.parse
original source code - 您需要添加它的原始来源以修复 parse
方法:
def parse(self, response):
filename = response.url.split("/")[-1] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
csv 源:
此日志行:
2019-07-23 14:16:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (153 items) in: exampledomainlevel1.csv
- 表示 csv feedexporter
已启用(可能在 settings.py
项目设置文件中。)
更新
我又观察了Crawlspider
源代码。
看起来 parse
方法在开始时只调用了一次,并且没有涵盖所有网络响应。
如果我的理论是正确的 - 将此功能添加到您的蜘蛛后 class 应该保存所有 html 响应:
def _response_downloaded(self, response):
filename = response.url.split("/")[-1] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
rule = self._rules[response.meta['rule']]
return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)