Scrapy:如何生成与 append() 连接的列表
Scrapy: how to yield a list concatenated with append()
我正在尝试使用 append() 函数产生或 return 一个列表,但我遇到了一些错误。
有解决办法吗?我评论了一些我遇到的错误。
抱歉,我是 python 编码新手。
class mySpider(CrawlSpider):
name = "testspider"
allowed_domains = ["example.com"]
start_urls = (
'http://www.example.com/test-page.html',
)
rules = (
Rule(LinkExtractor(allow=('')), callback="parse_items", follow=True),
)
def parse_items(self, response):
item = myItem()
#Extract some items
item['status'] = response.status
yield item
inlinks = []
links = LinkExtractor(canonicalize=False, unique=True).extract_links(response)
for link in links:
is_allowed = False
for allowed_domain in self.allowed_domains:
if allowed_domain in link.url:
is_allowed = True
if is_allowed:
inlink = anotherItem()
inlink['url_from'] = response.url
inlink['url_to'] = link.url
inlinks.append(inlink)
yield inlinks #ERROR: Spider must return Request, BaseItem, dict or None, got 'list' in <GET http://www.example.com/test-page.html>
#if using yield inlink of course I get just the first element, in my case I get only the first URL for every unique page
#using return inlinks I get
yield
一次一项。无需在结束时为 return 创建列表。
for link in links:
is_allowed = False
for allowed_domain in self.allowed_domains:
if allowed_domain in link.url:
is_allowed = True
if is_allowed:
inlink = anotherItem()
inlink['url_from'] = response.url
inlink['url_to'] = link.url
yield inlink
错误消息很明确,Spider 必须 return Request
、BaseItem
、dict
或 None
.
但是您正在 returning list
(它在 PHP 中称为数组)
for link in links:
is_allowed = False
for allowed_domain in self.allowed_domains:
if allowed_domain in link.url:
is_allowed = True
if is_allowed:
inlink = anotherItem()
inlink['url_from'] = response.url
inlink['url_to'] = link.url
yield inlink
您可以使用此代码来防止任何错误,一次只生成 1 个项目。
或者即使您想一次 return/yield 所有项目,也可以这样做
for link in links:
is_allowed = False
for allowed_domain in self.allowed_domains:
if allowed_domain in link.url:
is_allowed = True
if is_allowed:
inlink = anotherItem()
inlink['url_from'] = response.url
inlink['url_to'] = link.url
inlinks.append(inlink)
yield {'all_links': inlinks}
我正在尝试使用 append() 函数产生或 return 一个列表,但我遇到了一些错误。
有解决办法吗?我评论了一些我遇到的错误。
抱歉,我是 python 编码新手。
class mySpider(CrawlSpider):
name = "testspider"
allowed_domains = ["example.com"]
start_urls = (
'http://www.example.com/test-page.html',
)
rules = (
Rule(LinkExtractor(allow=('')), callback="parse_items", follow=True),
)
def parse_items(self, response):
item = myItem()
#Extract some items
item['status'] = response.status
yield item
inlinks = []
links = LinkExtractor(canonicalize=False, unique=True).extract_links(response)
for link in links:
is_allowed = False
for allowed_domain in self.allowed_domains:
if allowed_domain in link.url:
is_allowed = True
if is_allowed:
inlink = anotherItem()
inlink['url_from'] = response.url
inlink['url_to'] = link.url
inlinks.append(inlink)
yield inlinks #ERROR: Spider must return Request, BaseItem, dict or None, got 'list' in <GET http://www.example.com/test-page.html>
#if using yield inlink of course I get just the first element, in my case I get only the first URL for every unique page
#using return inlinks I get
yield
一次一项。无需在结束时为 return 创建列表。
for link in links:
is_allowed = False
for allowed_domain in self.allowed_domains:
if allowed_domain in link.url:
is_allowed = True
if is_allowed:
inlink = anotherItem()
inlink['url_from'] = response.url
inlink['url_to'] = link.url
yield inlink
错误消息很明确,Spider 必须 return Request
、BaseItem
、dict
或 None
.
但是您正在 returning list
(它在 PHP 中称为数组)
for link in links:
is_allowed = False
for allowed_domain in self.allowed_domains:
if allowed_domain in link.url:
is_allowed = True
if is_allowed:
inlink = anotherItem()
inlink['url_from'] = response.url
inlink['url_to'] = link.url
yield inlink
您可以使用此代码来防止任何错误,一次只生成 1 个项目。
或者即使您想一次 return/yield 所有项目,也可以这样做
for link in links:
is_allowed = False
for allowed_domain in self.allowed_domains:
if allowed_domain in link.url:
is_allowed = True
if is_allowed:
inlink = anotherItem()
inlink['url_from'] = response.url
inlink['url_to'] = link.url
inlinks.append(inlink)
yield {'all_links': inlinks}