需要有关黄页蜘蛛的帮助
Need helpwith YellowPages spider
我是 scrapy 的新手,到目前为止我已经能够创建一些蜘蛛。我想写一个蜘蛛来抓取黄页,寻找有 404 响应的网站,蜘蛛工作正常,但是,分页不起作用。任何帮助都感激不尽。提前致谢
# -*- coding: utf-8 -*-
import scrapy
class SpiderSpider(scrapy.Spider):
name = 'spider'
#allowed_domains = ['www.yellowpages.com']
start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL']
def parse(self, response):
for listing in response.css('div.search-results.organic div.srp-listing'):
url = listing.css('a.track-visit-website::attr(href)').extract_first()
yield scrapy.Request(url=url, callback=self.parse_details)
# follow pagination links
next_page_url = response.css('a.next.ajax-page::attr(href)').extract_first()
next_page_url = response.urljoin(next_page_url)
if next_page_url:
yield scrapy.Request(url=next_page_url, callback=self.parse)
def parse_details(self,response):
yield{'Response': response,}
我运行你的代码,发现有一些错误。在第一个循环中,你不检查url
的值,有时它是None
。此错误会停止执行,这就是您认为分页不起作用的原因。
这是一个工作代码:
# -*- coding: utf-8 -*-
import scrapy
class SpiderSpider(scrapy.Spider):
name = 'spider'
#allowed_domains = ['www.yellowpages.com']
start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL']
def parse(self, response):
for listing in response.css('div.search-results.organic div.srp-listing'):
url = listing.css('a.track-visit-website::attr(href)').extract_first()
if url:
yield scrapy.Request(url=url, callback=self.parse_details)
next_page_url = response.css('a.next.ajax-page::attr(href)').extract_first()
next_page_url = response.urljoin(next_page_url)
if next_page_url:
yield scrapy.Request(url=next_page_url, callback=self.parse)
def parse_details(self,response):
yield{'Response': response,}
我是 scrapy 的新手,到目前为止我已经能够创建一些蜘蛛。我想写一个蜘蛛来抓取黄页,寻找有 404 响应的网站,蜘蛛工作正常,但是,分页不起作用。任何帮助都感激不尽。提前致谢
# -*- coding: utf-8 -*-
import scrapy
class SpiderSpider(scrapy.Spider):
name = 'spider'
#allowed_domains = ['www.yellowpages.com']
start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL']
def parse(self, response):
for listing in response.css('div.search-results.organic div.srp-listing'):
url = listing.css('a.track-visit-website::attr(href)').extract_first()
yield scrapy.Request(url=url, callback=self.parse_details)
# follow pagination links
next_page_url = response.css('a.next.ajax-page::attr(href)').extract_first()
next_page_url = response.urljoin(next_page_url)
if next_page_url:
yield scrapy.Request(url=next_page_url, callback=self.parse)
def parse_details(self,response):
yield{'Response': response,}
我运行你的代码,发现有一些错误。在第一个循环中,你不检查url
的值,有时它是None
。此错误会停止执行,这就是您认为分页不起作用的原因。
这是一个工作代码:
# -*- coding: utf-8 -*-
import scrapy
class SpiderSpider(scrapy.Spider):
name = 'spider'
#allowed_domains = ['www.yellowpages.com']
start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL']
def parse(self, response):
for listing in response.css('div.search-results.organic div.srp-listing'):
url = listing.css('a.track-visit-website::attr(href)').extract_first()
if url:
yield scrapy.Request(url=url, callback=self.parse_details)
next_page_url = response.css('a.next.ajax-page::attr(href)').extract_first()
next_page_url = response.urljoin(next_page_url)
if next_page_url:
yield scrapy.Request(url=next_page_url, callback=self.parse)
def parse_details(self,response):
yield{'Response': response,}