未定义解析回调 - Simple Webscraper (Scrapy) 仍然没有 运行
Parse callback is not defined - Simple Webscraper (Scrapy) still not running
我在谷歌上搜索了半天,仍然无法解决问题。也许你有一些见解?
我尝试不是从终端而是从脚本启动我的抓取工具。
这在没有规则的情况下运行良好,只需产生正常的解析函数。
只要我使用规则并将 "callback="parse"" 更改为 "callback="parse_item"",就不再有任何效果。
我尝试根据解析函数中的生成请求创建一个爬虫。结果是:我只抓取了一个 URL,但没有抓取域。
制定规则似乎是一条出路。
所以我实际上希望它 运行 而不是在解析函数中使用 yield。
import scrapy
from scrapy.crawler import CrawlerProcess
from bs4 import BeautifulSoup
from scrapy.http import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
def beauty(response_dictionary):
html_response = response_dictionary["html"]
print(response_dictionary["url"])
for html in html_response:
soup = BeautifulSoup(html, 'lxml')
metatag = soup.find_all("meta")
print(metatag)
class MySpider(scrapy.Spider):
name = "MySpidername"
allowed_domains = ["www.bueffeln.net"]
start_urls = ['https://www.bueffeln.net']
rules = [Rule(LinkExtractor(allow=()), callback='parse_item', follow=True),]
def parse_item(self, response):
url_dictionary = {}
print(response.status)
url_dictionary["url"] = response.url
print(response.headers)
url_dictionary["html"] = response.xpath('//html').getall()
beauty(url_dictionary)
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()
错误似乎如下:
2019-11-18 18:14:56 [scrapy.utils.log] INFO: Scrapy 1.7.4 started (bot: scrapybot)
2019-11-18 18:14:56 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 19:29:22) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
2019-11-18 18:14:56 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2019-11-18 18:14:56 [scrapy.extensions.telnet] INFO: Telnet Password: 970cca12e7c43d67
2019-11-18 18:14:56 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-11-18 18:14:57 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-11-18 18:14:57 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-11-18 18:14:57 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-11-18 18:14:57 [scrapy.core.engine] INFO: Spider opened
2019-11-18 18:14:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-11-18 18:14:57 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-11-18 18:14:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bueffeln.net> (referer: None)
2019-11-18 18:14:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.bueffeln.net> (referer: None)
Traceback (most recent call last):
File "C:\Users\msi\PycharmProjects\test\venv\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\msi\PycharmProjects\test\venv\lib\site-packages\scrapy\spiders\__init__.py", line 80, in parse
raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))
NotImplementedError: MySpider.parse callback is not defined
2019-11-18 18:14:57 [scrapy.core.engine] INFO: Closing spider (finished)
2019-11-18 18:14:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 231,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 16695,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.435081,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 11, 18, 17, 14, 57, 454733),
'log_count/DEBUG': 1,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/NotImplementedError': 1,
'start_time': datetime.datetime(2019, 11, 18, 17, 14, 57, 19652)}
2019-11-18 18:14:57 [scrapy.core.engine] INFO: Spider closed (finished)
Process finished with exit code 0
Scrapy 使用 parse
回调从 start_urls
解析 URLs。你没有提供这样的回调,这就是为什么 Scrapy 无法处理你的 https://www.bueffeln.net
URL.
如果你想让你的代码工作,你需要添加 parse
回调(即使是空的)。您的 rules
将在 parse
回调后应用。
更新
要使用规则,您需要 scrapy.CrawlSpider
:
class MySpider(scrapy.CrawlSpider):
我在谷歌上搜索了半天,仍然无法解决问题。也许你有一些见解?
我尝试不是从终端而是从脚本启动我的抓取工具。 这在没有规则的情况下运行良好,只需产生正常的解析函数。
只要我使用规则并将 "callback="parse"" 更改为 "callback="parse_item"",就不再有任何效果。
我尝试根据解析函数中的生成请求创建一个爬虫。结果是:我只抓取了一个 URL,但没有抓取域。
制定规则似乎是一条出路。 所以我实际上希望它 运行 而不是在解析函数中使用 yield。
import scrapy
from scrapy.crawler import CrawlerProcess
from bs4 import BeautifulSoup
from scrapy.http import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
def beauty(response_dictionary):
html_response = response_dictionary["html"]
print(response_dictionary["url"])
for html in html_response:
soup = BeautifulSoup(html, 'lxml')
metatag = soup.find_all("meta")
print(metatag)
class MySpider(scrapy.Spider):
name = "MySpidername"
allowed_domains = ["www.bueffeln.net"]
start_urls = ['https://www.bueffeln.net']
rules = [Rule(LinkExtractor(allow=()), callback='parse_item', follow=True),]
def parse_item(self, response):
url_dictionary = {}
print(response.status)
url_dictionary["url"] = response.url
print(response.headers)
url_dictionary["html"] = response.xpath('//html').getall()
beauty(url_dictionary)
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()
错误似乎如下:
2019-11-18 18:14:56 [scrapy.utils.log] INFO: Scrapy 1.7.4 started (bot: scrapybot)
2019-11-18 18:14:56 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 19:29:22) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
2019-11-18 18:14:56 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2019-11-18 18:14:56 [scrapy.extensions.telnet] INFO: Telnet Password: 970cca12e7c43d67
2019-11-18 18:14:56 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-11-18 18:14:57 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-11-18 18:14:57 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-11-18 18:14:57 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-11-18 18:14:57 [scrapy.core.engine] INFO: Spider opened
2019-11-18 18:14:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-11-18 18:14:57 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-11-18 18:14:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bueffeln.net> (referer: None)
2019-11-18 18:14:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.bueffeln.net> (referer: None)
Traceback (most recent call last):
File "C:\Users\msi\PycharmProjects\test\venv\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\msi\PycharmProjects\test\venv\lib\site-packages\scrapy\spiders\__init__.py", line 80, in parse
raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))
NotImplementedError: MySpider.parse callback is not defined
2019-11-18 18:14:57 [scrapy.core.engine] INFO: Closing spider (finished)
2019-11-18 18:14:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 231,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 16695,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.435081,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 11, 18, 17, 14, 57, 454733),
'log_count/DEBUG': 1,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/NotImplementedError': 1,
'start_time': datetime.datetime(2019, 11, 18, 17, 14, 57, 19652)}
2019-11-18 18:14:57 [scrapy.core.engine] INFO: Spider closed (finished)
Process finished with exit code 0
Scrapy 使用 parse
回调从 start_urls
解析 URLs。你没有提供这样的回调,这就是为什么 Scrapy 无法处理你的 https://www.bueffeln.net
URL.
如果你想让你的代码工作,你需要添加 parse
回调(即使是空的)。您的 rules
将在 parse
回调后应用。
更新
要使用规则,您需要 scrapy.CrawlSpider
:
class MySpider(scrapy.CrawlSpider):