Scrapy 没有抓取所有页面
Scrapy not scraping all pages
我是 scrapy 的新手,一直在尝试开发一种爬虫来抓取 Tripadvisor 的推荐活动页面。 Trip Advisor 用偏移量对结果进行分页,所以我让它找到最后一页的编号,乘以每页的结果数,并以 30 为步长循环一个范围。但是它 returns 只有它应该的结果的一小部分到,并且 get_details 打印出 28 页中的 7 页。我相信正在发生的事情是 url 随机页面上的重定向。
Scrapy 在其他页面上记录了这个 301 重定向,它似乎正在重定向到第一页。我尝试禁用重定向,但没有用。
2021-03-28 18:46:38 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.tripadvisor.com/Attractions-g55229-Activities-a_allAttractions.true-Nashville_Davidson_County_Tennessee.html> from <GET https://www.tripadvisor.com/Attractions-g55229-Activities-a_allAttractions.true-oa90-Nashville_Davidson_County_Tennessee.html>
这是我的蜘蛛代码:
import scrapy
import re
class TripadvisorSpider(scrapy.Spider):
name = "tripadvisor"
start_urls = [
'https://www.tripadvisor.com/Attractions-g55229-Activities-a_allAttractions.true-oa{}-Nashville_Davidson_County_Tennessee.html'
]
def parse(self, response):
num_pages = int(response.css(
'._37Nr884k .DrjyGw-P.IT-ONkaj::text')[-1].get())
for offset in range(0, num_pages * 30, 30):
formatted_url = self.start_urls[0].format(offset)
yield scrapy.Request(formatted_url, callback=self.get_details)
def get_details(self, response):
print('url is ' + response.url)
for listing in response.css('div._19L437XW._1qhi5DVB.CO7bjfl5'):
yield {
'title': listing.css('._392swiRT ._1gpq3zsA._1zP41Z7X::text')[1].get(),
'category': listing.css('._392swiRT ._1fV2VpKV .DrjyGw-P._26S7gyB4._3SccQt-T::text').get(),
'rating': float(re.findall(r"[-+]?\d*\.\d+|\d+", listing.css('svg.zWXXYhVR::attr(title)').get())[0]),
'rating_count': float(listing.css('._392swiRT .DrjyGw-P._26S7gyB4._14_buatE._1dimhEoy::text').get().replace(',', '')),
'url': listing.css('._3W_31Rvp._1nUIPWja._17LAEUXp._2b3s5IMB a::attr(href)').get(),
'main_image': listing.css('._1BR0J4XD').attrib['src']
}
有没有办法让 scrapy 为每个页面工作?到底是什么导致了这个问题?
找到解决办法。发现我需要手动处理重定向并禁用 Scrapy 的默认中间件。
这是我添加到 middlewares.py
的自定义中间件
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.selector import Selector
from scrapy.utils.response import get_meta_refresh
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
url = response.url
if response.status in [301, 307]:
reason = 'redirect %d' % response.status
return self._retry(request, reason, spider) or response
interval, redirect_url = get_meta_refresh(response)
# handle meta redirect
if redirect_url:
reason = 'meta'
return self._retry(request, reason, spider) or response
hxs = Selector(response)
# test for captcha page
captcha = hxs.xpath(
".//input[contains(@id, 'captchacharacters')]").extract()
if captcha:
reason = 'capcha'
return self._retry(request, reason, spider) or response
return response
这是该问题最佳答案的更新版本。
Scrapy retry or redirect middleware
我是 scrapy 的新手,一直在尝试开发一种爬虫来抓取 Tripadvisor 的推荐活动页面。 Trip Advisor 用偏移量对结果进行分页,所以我让它找到最后一页的编号,乘以每页的结果数,并以 30 为步长循环一个范围。但是它 returns 只有它应该的结果的一小部分到,并且 get_details 打印出 28 页中的 7 页。我相信正在发生的事情是 url 随机页面上的重定向。
Scrapy 在其他页面上记录了这个 301 重定向,它似乎正在重定向到第一页。我尝试禁用重定向,但没有用。
2021-03-28 18:46:38 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.tripadvisor.com/Attractions-g55229-Activities-a_allAttractions.true-Nashville_Davidson_County_Tennessee.html> from <GET https://www.tripadvisor.com/Attractions-g55229-Activities-a_allAttractions.true-oa90-Nashville_Davidson_County_Tennessee.html>
这是我的蜘蛛代码:
import scrapy
import re
class TripadvisorSpider(scrapy.Spider):
name = "tripadvisor"
start_urls = [
'https://www.tripadvisor.com/Attractions-g55229-Activities-a_allAttractions.true-oa{}-Nashville_Davidson_County_Tennessee.html'
]
def parse(self, response):
num_pages = int(response.css(
'._37Nr884k .DrjyGw-P.IT-ONkaj::text')[-1].get())
for offset in range(0, num_pages * 30, 30):
formatted_url = self.start_urls[0].format(offset)
yield scrapy.Request(formatted_url, callback=self.get_details)
def get_details(self, response):
print('url is ' + response.url)
for listing in response.css('div._19L437XW._1qhi5DVB.CO7bjfl5'):
yield {
'title': listing.css('._392swiRT ._1gpq3zsA._1zP41Z7X::text')[1].get(),
'category': listing.css('._392swiRT ._1fV2VpKV .DrjyGw-P._26S7gyB4._3SccQt-T::text').get(),
'rating': float(re.findall(r"[-+]?\d*\.\d+|\d+", listing.css('svg.zWXXYhVR::attr(title)').get())[0]),
'rating_count': float(listing.css('._392swiRT .DrjyGw-P._26S7gyB4._14_buatE._1dimhEoy::text').get().replace(',', '')),
'url': listing.css('._3W_31Rvp._1nUIPWja._17LAEUXp._2b3s5IMB a::attr(href)').get(),
'main_image': listing.css('._1BR0J4XD').attrib['src']
}
有没有办法让 scrapy 为每个页面工作?到底是什么导致了这个问题?
找到解决办法。发现我需要手动处理重定向并禁用 Scrapy 的默认中间件。
这是我添加到 middlewares.py
的自定义中间件from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.selector import Selector
from scrapy.utils.response import get_meta_refresh
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
url = response.url
if response.status in [301, 307]:
reason = 'redirect %d' % response.status
return self._retry(request, reason, spider) or response
interval, redirect_url = get_meta_refresh(response)
# handle meta redirect
if redirect_url:
reason = 'meta'
return self._retry(request, reason, spider) or response
hxs = Selector(response)
# test for captcha page
captcha = hxs.xpath(
".//input[contains(@id, 'captchacharacters')]").extract()
if captcha:
reason = 'capcha'
return self._retry(request, reason, spider) or response
return response
这是该问题最佳答案的更新版本。 Scrapy retry or redirect middleware