Scrapy 合约 101

Scrapy contracts 101

我想尝试使用 Scrapy 合约,作为成熟测试套件的替代方案。

下面是复制步骤的详细说明。

tmp 目录中

> cd /tmp

运行

> source venv/bin/activate

激活Python 3.8 和 Scrapy 2.5.0。那么:

> scrapy startproject foo
New Scrapy project 'foo'...
> cd foo
> scrapy genspider bar example.com
Created spider 'bar' using template 'basic' in module:
  foo.spiders.bar

当时运行宁

> scrapy check

如实说一切都很好。

Ran 0 contracts in 0.000s
OK

现在编辑 /tmp/foo/foo/spiders/bar.py,然后替换:

def parse(self, response):
    pass

with

def parse(self, response):
    """ This function parses a sample response. Some contracts are mingled
    with this docstring.

    @url http://toscrape.com/
    @returns items 1 16
    @returns requests 0 0
    @scrapes Title Author Year Price
    """

测试失败。这很酷。我们将调整它以使其作为 sequel (不是这个问题)。在这个 101 中,我们将避免错误。

注释掉:

# allowed_domains = ['example.com']

让合同 URL (toscrape.com) 通过。

此时scrapy check说:

F..
======================================================================
FAIL: [bar] parse (@returns post-hook)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "scrapy/contracts/__init__.py", line 54, in wrapper
    self.post_process(output)
  File "scrapy/contracts/default.py", line 92, in post_process
    raise ContractFail(f"Returned {occurrences} {self.obj_name}, expected {expected}")
scrapy.exceptions.ContractFail: Returned 0 items, expected 1..16

----------------------------------------------------------------------
Ran 3 contracts in 0.322s

FAILED (failures=1)

我错过了什么?

使用 @url http://www.amazon.com/s?field-keywords=selfish+gene 我也得到错误 503

可能是非常古老的示例 - 它使用 http 但现代页面使用 https - amazone 可以重建页面,现在它有更好的系统来检测 spamers/hackers/bots 并阻止他们。

如果我使用 @url http://toscrape.com/ 那么我不会收到错误 503 但我仍然会收到其他错误 FAILED 因为它需要 parse()[=31 中的一些代码=]

@scrapes Title Author Year Price 意味着它必须 return 带有键 Title Author Year Price

的项目
    item = {'Title': '', 'Author': '', 'Year': '', 'Price': ''}

@returns items 1 16 表示它必须 return 至少 1 项,最多 16 项

至少 1 项:

yield item

最多 16 项:

for _ in range(16):
    yield item

@returns requests 0 0 表示您不能使用 yield Request(absolute_url, ...)yield response.follow(relative_url, ...)

所以parse()需要像

这样的代码
import scrapy
from scrapy.http import Request

class ExampleSpider(scrapy.Spider):
    name = 'example'
    #allowed_domains = ['toscrape.com']
    start_urls = ['http://toscrape.com/']

    def parse(self, response):
        """
        ### @url http://www.amazon.com/s?field-keywords=selfish+gene
        ### @url https://www.amazon.com/s?k=selfish+gene
        @url http://toscrape.com/
        @returns items 1 16
        @returns requests 0 0
        @scrapes Title Author Year Price
        """

        print('[parse] url:', response.url)

        # don't return requests
        #yield Request('http://toscrape.com/other_page')
        #yield response.follow('/other_page')
        
        # it has to return item with keys: Title Author Year Price
        item = {'Title': '', 'Author': '', 'Year': '', 'Price': ''}

        # it has to return at least 1 item, and at most 16 items
        for _ in range(1):  
            yield item

然后我得到结果

[parse] url: http://toscrape.com/
...
----------------------------------------------------------------------
Ran 3 contracts in 0.670s

OK