为什么相同的 URL 被抓取两次而不是两次不同的 start_urls?
Why is same URL scraped twice instead of two different start_urls?
我有以下蜘蛛:
class SpiderOpTest(Spider):
name = "test"
start_urls = [
"https://www.oddsportal.com/tennis/argentina/atp-buenos-aires/results/#/page/2/",
"https://www.oddsportal.com/tennis/argentina/atp-buenos-aires-2012/results/#/page/2/",
]
custom_settings = {
"USER_AGENT": "*",
"LOG_LEVEL": "WARNING",
"DOWNLOADER_MIDDLEWARES": {'scraper_scrapy.odds.middlewares.SeleniumMiddleware': 543},
}
httperror_allowed_codes = [301]
def parse(self, response):
print(f"Parsing tournament page - {response.url}")
当我 运行 时,打印输出表明 start_urls
的第一个 URL 已被抓取两次。为什么会这样?
由于页面的关键位是通过 Javascript 加载的,因此包含我正在使用的 Selenium 中间件可能对我有用:
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
class SeleniumMiddleware:
@classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
def process_request(self, request, spider):
self.driver.get(request.url)
return HtmlResponse(
self.driver.current_url,
body=self.driver.page_source,
encoding='utf-8',
request=request,
)
def spider_opened(self, spider):
options = webdriver.FirefoxOptions()
options.add_argument("--headless")
self.driver = webdriver.Firefox(options=options)
def spider_closed(self, spider):
self.driver.close()
您的问题似乎与此行有关:self.driver.current_url,
。驱动程序的 URL 设置为第一个 URL 并且永远不会更新。我认为您应该在该行上使用 request.url
:
def process_request(self, request, spider):
self.driver.get(request.url)
return HtmlResponse(
request.url,
body=self.driver.page_source,
encoding='utf-8',
request=request,
)
我有以下蜘蛛:
class SpiderOpTest(Spider):
name = "test"
start_urls = [
"https://www.oddsportal.com/tennis/argentina/atp-buenos-aires/results/#/page/2/",
"https://www.oddsportal.com/tennis/argentina/atp-buenos-aires-2012/results/#/page/2/",
]
custom_settings = {
"USER_AGENT": "*",
"LOG_LEVEL": "WARNING",
"DOWNLOADER_MIDDLEWARES": {'scraper_scrapy.odds.middlewares.SeleniumMiddleware': 543},
}
httperror_allowed_codes = [301]
def parse(self, response):
print(f"Parsing tournament page - {response.url}")
当我 运行 时,打印输出表明 start_urls
的第一个 URL 已被抓取两次。为什么会这样?
由于页面的关键位是通过 Javascript 加载的,因此包含我正在使用的 Selenium 中间件可能对我有用:
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
class SeleniumMiddleware:
@classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
def process_request(self, request, spider):
self.driver.get(request.url)
return HtmlResponse(
self.driver.current_url,
body=self.driver.page_source,
encoding='utf-8',
request=request,
)
def spider_opened(self, spider):
options = webdriver.FirefoxOptions()
options.add_argument("--headless")
self.driver = webdriver.Firefox(options=options)
def spider_closed(self, spider):
self.driver.close()
您的问题似乎与此行有关:self.driver.current_url,
。驱动程序的 URL 设置为第一个 URL 并且永远不会更新。我认为您应该在该行上使用 request.url
:
def process_request(self, request, spider):
self.driver.get(request.url)
return HtmlResponse(
request.url,
body=self.driver.page_source,
encoding='utf-8',
request=request,
)