Scrapy 合约 101
Scrapy contracts 101
我想尝试使用 Scrapy 合约,作为成熟测试套件的替代方案。
下面是复制步骤的详细说明。
在 tmp
目录中
> cd /tmp
运行
> source venv/bin/activate
激活Python 3.8 和 Scrapy 2.5.0。那么:
> scrapy startproject foo
New Scrapy project 'foo'...
> cd foo
> scrapy genspider bar example.com
Created spider 'bar' using template 'basic' in module:
foo.spiders.bar
当时运行宁
> scrapy check
如实说一切都很好。
Ran 0 contracts in 0.000s
OK
现在编辑 /tmp/foo/foo/spiders/bar.py
,然后替换:
def parse(self, response):
pass
def parse(self, response):
""" This function parses a sample response. Some contracts are mingled
with this docstring.
@url http://toscrape.com/
@returns items 1 16
@returns requests 0 0
@scrapes Title Author Year Price
"""
测试将失败。这很酷。我们将调整它以使其作为 sequel (不是这个问题)。在这个 101 中,我们将避免错误。
注释掉:
# allowed_domains = ['example.com']
让合同 URL (toscrape.com
) 通过。
此时scrapy check
说:
F..
======================================================================
FAIL: [bar] parse (@returns post-hook)
----------------------------------------------------------------------
Traceback (most recent call last):
File "scrapy/contracts/__init__.py", line 54, in wrapper
self.post_process(output)
File "scrapy/contracts/default.py", line 92, in post_process
raise ContractFail(f"Returned {occurrences} {self.obj_name}, expected {expected}")
scrapy.exceptions.ContractFail: Returned 0 items, expected 1..16
----------------------------------------------------------------------
Ran 3 contracts in 0.322s
FAILED (failures=1)
我错过了什么?
使用 @url http://www.amazon.com/s?field-keywords=selfish+gene
我也得到错误 503
。
可能是非常古老的示例 - 它使用 http
但现代页面使用 https
- amazone
可以重建页面,现在它有更好的系统来检测 spamers/hackers/bots 并阻止他们。
如果我使用 @url http://toscrape.com/
那么我不会收到错误 503
但我仍然会收到其他错误 FAILED
因为它需要 parse()
[=31 中的一些代码=]
@scrapes Title Author Year Price
意味着它必须 return 带有键 Title Author Year Price
的项目
item = {'Title': '', 'Author': '', 'Year': '', 'Price': ''}
@returns items 1 16
表示它必须 return 至少 1 项,最多 16 项
至少 1 项:
yield item
最多 16 项:
for _ in range(16):
yield item
@returns requests 0 0
表示您不能使用 yield Request(absolute_url, ...)
或 yield response.follow(relative_url, ...)
所以parse()
需要像
这样的代码
import scrapy
from scrapy.http import Request
class ExampleSpider(scrapy.Spider):
name = 'example'
#allowed_domains = ['toscrape.com']
start_urls = ['http://toscrape.com/']
def parse(self, response):
"""
### @url http://www.amazon.com/s?field-keywords=selfish+gene
### @url https://www.amazon.com/s?k=selfish+gene
@url http://toscrape.com/
@returns items 1 16
@returns requests 0 0
@scrapes Title Author Year Price
"""
print('[parse] url:', response.url)
# don't return requests
#yield Request('http://toscrape.com/other_page')
#yield response.follow('/other_page')
# it has to return item with keys: Title Author Year Price
item = {'Title': '', 'Author': '', 'Year': '', 'Price': ''}
# it has to return at least 1 item, and at most 16 items
for _ in range(1):
yield item
然后我得到结果
[parse] url: http://toscrape.com/
...
----------------------------------------------------------------------
Ran 3 contracts in 0.670s
OK
我想尝试使用 Scrapy 合约,作为成熟测试套件的替代方案。
下面是复制步骤的详细说明。
在 tmp
目录中
> cd /tmp
运行
> source venv/bin/activate
激活Python 3.8 和 Scrapy 2.5.0。那么:
> scrapy startproject foo
New Scrapy project 'foo'...
> cd foo
> scrapy genspider bar example.com
Created spider 'bar' using template 'basic' in module:
foo.spiders.bar
当时运行宁
> scrapy check
如实说一切都很好。
Ran 0 contracts in 0.000s
OK
现在编辑 /tmp/foo/foo/spiders/bar.py
,然后替换:
def parse(self, response):
pass
def parse(self, response):
""" This function parses a sample response. Some contracts are mingled
with this docstring.
@url http://toscrape.com/
@returns items 1 16
@returns requests 0 0
@scrapes Title Author Year Price
"""
测试将失败。这很酷。我们将调整它以使其作为 sequel (不是这个问题)。在这个 101 中,我们将避免错误。
注释掉:
# allowed_domains = ['example.com']
让合同 URL (toscrape.com
) 通过。
此时scrapy check
说:
F..
======================================================================
FAIL: [bar] parse (@returns post-hook)
----------------------------------------------------------------------
Traceback (most recent call last):
File "scrapy/contracts/__init__.py", line 54, in wrapper
self.post_process(output)
File "scrapy/contracts/default.py", line 92, in post_process
raise ContractFail(f"Returned {occurrences} {self.obj_name}, expected {expected}")
scrapy.exceptions.ContractFail: Returned 0 items, expected 1..16
----------------------------------------------------------------------
Ran 3 contracts in 0.322s
FAILED (failures=1)
我错过了什么?
使用 @url http://www.amazon.com/s?field-keywords=selfish+gene
我也得到错误 503
。
可能是非常古老的示例 - 它使用 http
但现代页面使用 https
- amazone
可以重建页面,现在它有更好的系统来检测 spamers/hackers/bots 并阻止他们。
如果我使用 @url http://toscrape.com/
那么我不会收到错误 503
但我仍然会收到其他错误 FAILED
因为它需要 parse()
[=31 中的一些代码=]
@scrapes Title Author Year Price
意味着它必须 return 带有键 Title Author Year Price
item = {'Title': '', 'Author': '', 'Year': '', 'Price': ''}
@returns items 1 16
表示它必须 return 至少 1 项,最多 16 项
至少 1 项:
yield item
最多 16 项:
for _ in range(16):
yield item
@returns requests 0 0
表示您不能使用 yield Request(absolute_url, ...)
或 yield response.follow(relative_url, ...)
所以parse()
需要像
import scrapy
from scrapy.http import Request
class ExampleSpider(scrapy.Spider):
name = 'example'
#allowed_domains = ['toscrape.com']
start_urls = ['http://toscrape.com/']
def parse(self, response):
"""
### @url http://www.amazon.com/s?field-keywords=selfish+gene
### @url https://www.amazon.com/s?k=selfish+gene
@url http://toscrape.com/
@returns items 1 16
@returns requests 0 0
@scrapes Title Author Year Price
"""
print('[parse] url:', response.url)
# don't return requests
#yield Request('http://toscrape.com/other_page')
#yield response.follow('/other_page')
# it has to return item with keys: Title Author Year Price
item = {'Title': '', 'Author': '', 'Year': '', 'Price': ''}
# it has to return at least 1 item, and at most 16 items
for _ in range(1):
yield item
然后我得到结果
[parse] url: http://toscrape.com/
...
----------------------------------------------------------------------
Ran 3 contracts in 0.670s
OK