如何判断`yield scrapy.Request`返回的生成器是否有数据?
How to determine if the generator returned from `yield scrapy.Request` has any data?
在Scrapy Tutorial中,蜘蛛从class="next"
中提取下一页链接并抓取它们-
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
就我而言,我无法在从网络服务器下载的文件中找到下一页链接,但我知道格式是 response.url
与 /page/[page number]/
连接。不产生引号的请求页面仍然 return 和 response
,例如 - No quotes found!。由于下一页的数量通常少于 20,我可以通过将蜘蛛的最后 3 行替换为 -
来遍历所有可能的 url
for page_num in range(2, 20):
yield response.follow(f"/page/{page_num}/", callback=self.parse)
然而这会强制蜘蛛请求页面(例如http://quotes.toscrape.com/page/11 to 20) which don't yield quotes. How can I adjust my spider to terminate the page_num
loop after requesting the first page which does not yield quotes? (such as http://quotes.toscrape.com/page/11)
伪代码-
page_num = 2
while (quotes are yielded from the response):
yield response.follow(f"/page/{page_num}/", callback=self.parse)
page_num += 1
您可以使用 response.css('..')
的结果作为下一页的条件。
在这种情况下,您的代码将是这样的:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
page_num = get_pagenumber_from_url(response.url)
quotes_sel = response.css('div.quote'):
# quotes_sel - will be SelectorList if page have item data
# quotes_sel - will be None if page doesn't have item data
for quote in quotes_sel:
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
if quotes_sel:
next_page_url = f"/page/{str(page_num+1)}"
yield response.follow(next_page_url , callback=self.parse)
在Scrapy Tutorial中,蜘蛛从class="next"
中提取下一页链接并抓取它们-
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
就我而言,我无法在从网络服务器下载的文件中找到下一页链接,但我知道格式是 response.url
与 /page/[page number]/
连接。不产生引号的请求页面仍然 return 和 response
,例如 - No quotes found!。由于下一页的数量通常少于 20,我可以通过将蜘蛛的最后 3 行替换为 -
for page_num in range(2, 20):
yield response.follow(f"/page/{page_num}/", callback=self.parse)
然而这会强制蜘蛛请求页面(例如http://quotes.toscrape.com/page/11 to 20) which don't yield quotes. How can I adjust my spider to terminate the page_num
loop after requesting the first page which does not yield quotes? (such as http://quotes.toscrape.com/page/11)
伪代码-
page_num = 2
while (quotes are yielded from the response):
yield response.follow(f"/page/{page_num}/", callback=self.parse)
page_num += 1
您可以使用 response.css('..')
的结果作为下一页的条件。
在这种情况下,您的代码将是这样的:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
page_num = get_pagenumber_from_url(response.url)
quotes_sel = response.css('div.quote'):
# quotes_sel - will be SelectorList if page have item data
# quotes_sel - will be None if page doesn't have item data
for quote in quotes_sel:
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
if quotes_sel:
next_page_url = f"/page/{str(page_num+1)}"
yield response.follow(next_page_url , callback=self.parse)