Scrapy:如何生成与 append() 连接的列表

Scrapy: how to yield a list concatenated with append()

我正在尝试使用 append() 函数产生或 return 一个列表,但我遇到了一些错误。

有解决办法吗?我评论了一些我遇到的错误。

抱歉,我是 python 编码新手。

class mySpider(CrawlSpider):
    name = "testspider"
    allowed_domains = ["example.com"]
    start_urls = (
        'http://www.example.com/test-page.html',
    )

    rules = (
        Rule(LinkExtractor(allow=('')), callback="parse_items", follow=True),
    )

    def parse_items(self, response):
        item = myItem()

        #Extract some items
        item['status'] = response.status
        yield item

        inlinks = []
        links = LinkExtractor(canonicalize=False, unique=True).extract_links(response)
        for link in links:
            is_allowed = False
            for allowed_domain in self.allowed_domains:
                if allowed_domain in link.url:
                    is_allowed = True
            if is_allowed:
                inlink = anotherItem()
                inlink['url_from'] = response.url
                inlink['url_to'] = link.url
                inlinks.append(inlink)
        yield inlinks #ERROR: Spider must return Request, BaseItem, dict or None, got 'list' in <GET http://www.example.com/test-page.html>
        #if using yield inlink of course I get just the first element, in my case I get only the first URL for every unique page 
        #using return inlinks I get 

yield 一次一项。无需在结束时为 return 创建列表。

    for link in links:
        is_allowed = False
        for allowed_domain in self.allowed_domains:
            if allowed_domain in link.url:
                is_allowed = True
        if is_allowed:
            inlink = anotherItem()
            inlink['url_from'] = response.url
            inlink['url_to'] = link.url
            yield inlink

错误消息很明确,Spider 必须 return RequestBaseItemdictNone.

但是您正在 returning list(它在 PHP 中称为数组)

    for link in links:
        is_allowed = False
        for allowed_domain in self.allowed_domains:
            if allowed_domain in link.url:
                is_allowed = True
        if is_allowed:
            inlink = anotherItem()
            inlink['url_from'] = response.url
            inlink['url_to'] = link.url
            yield inlink 

您可以使用此代码来防止任何错误,一次只生成 1 个项目。

或者即使您想一次 return/yield 所有项目,也可以这样做

    for link in links:
        is_allowed = False
        for allowed_domain in self.allowed_domains:
            if allowed_domain in link.url:
                is_allowed = True
        if is_allowed:
            inlink = anotherItem()
            inlink['url_from'] = response.url
            inlink['url_to'] = link.url
            inlinks.append(inlink)
    yield {'all_links': inlinks}