Scrapy：如何生成与 append() 连接的列表

Question

我正在尝试使用 append() 函数产生或 return 一个列表，但我遇到了一些错误。

有解决办法吗？我评论了一些我遇到的错误。

抱歉，我是 python 编码新手。

class mySpider(CrawlSpider):
    name = "testspider"
    allowed_domains = ["example.com"]
    start_urls = (
        'http://www.example.com/test-page.html',
    )

    rules = (
        Rule(LinkExtractor(allow=('')), callback="parse_items", follow=True),
    )

    def parse_items(self, response):
        item = myItem()

        #Extract some items
        item['status'] = response.status
        yield item

        inlinks = []
        links = LinkExtractor(canonicalize=False, unique=True).extract_links(response)
        for link in links:
            is_allowed = False
            for allowed_domain in self.allowed_domains:
                if allowed_domain in link.url:
                    is_allowed = True
            if is_allowed:
                inlink = anotherItem()
                inlink['url_from'] = response.url
                inlink['url_to'] = link.url
                inlinks.append(inlink)
        yield inlinks #ERROR: Spider must return Request, BaseItem, dict or None, got 'list' in <GET http://www.example.com/test-page.html>
        #if using yield inlink of course I get just the first element, in my case I get only the first URL for every unique page 
        #using return inlinks I get

Answer 1

yield 一次一项。无需在结束时为 return 创建列表。

    for link in links:
        is_allowed = False
        for allowed_domain in self.allowed_domains:
            if allowed_domain in link.url:
                is_allowed = True
        if is_allowed:
            inlink = anotherItem()
            inlink['url_from'] = response.url
            inlink['url_to'] = link.url
            yield inlink

Answer 2

错误消息很明确，Spider 必须 return Request、BaseItem、dict 或 None.

但是您正在 returning list（它在 PHP 中称为数组）

    for link in links:
        is_allowed = False
        for allowed_domain in self.allowed_domains:
            if allowed_domain in link.url:
                is_allowed = True
        if is_allowed:
            inlink = anotherItem()
            inlink['url_from'] = response.url
            inlink['url_to'] = link.url
            yield inlink

您可以使用此代码来防止任何错误，一次只生成 1 个项目。

或者即使您想一次 return/yield 所有项目，也可以这样做

    for link in links:
        is_allowed = False
        for allowed_domain in self.allowed_domains:
            if allowed_domain in link.url:
                is_allowed = True
        if is_allowed:
            inlink = anotherItem()
            inlink['url_from'] = response.url
            inlink['url_to'] = link.url
            inlinks.append(inlink)
    yield {'all_links': inlinks}

Scrapy：如何生成与 append() 连接的列表

Scrapy: how to yield a list concatenated with append()

python

scrapy

python-2.7

scrapy-spider