如何在使用 scrapy 框架进行抓取时排除已经抓取的 url
How to exclude urls already scraped while doing scraping using scrapy framework
我正在抓取一个新闻网站,该网站提取新闻数据并将其转储到 MongoDB。
我的蜘蛛是用以下规则定义的:
rules = [Rule(
LinkExtractor(
allow=["foo.tv/en/*",
"https://fooports.tv/*"] # only such urls
我目前所做的是它从数据库中获取已经抓取的 urls 并且如果在数据库中找到这些 urls 则不处理这些 urls 例如:
urls_visited = get_visited_urls() # Fetches from MongoDB
if response.url not in urls_visited:
# do scraping here
我正在寻找的是有什么方法可以让蜘蛛跳过那些已经被抓取的 urls。我想尝试通过不查看那些已经处理过的 url 来减少抓取时间。我知道规则中有拒绝功能,但不确定在这种情况下如何使用它。
我已经包含了下载器中间件自定义 class 来过滤掉已经被抓取的请求:
class NewsCrawlerDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
def __init__(self):
self.urls_visited = get_visited_urls() # from database
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
# Here we check if url has already been scraped,
# if not process the requests
if request.url in self.urls_visited:
logging.info('ignoring url %s', request.url)
raise IgnoreRequest()
else:
return request
我的中间件订单在settings.py
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'news_crawler.middlewares.NewsCrawlerDownloaderMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
然而,当它首先尝试抓取时 url 它给了我以下错误:
ERROR: Error downloading <GET https://arynews.tv/robots.txt>: maximum recursion depth exceeded while calling a Python object
关于如何正确使用我的自定义下载器中间件来过滤掉 url 的任何想法。
您可以创建一个下载器中间件,它将根据您的数据库查询执行请求过滤。查看 documentation。
在这种情况下,您需要使用 process_request(request, spider)
方法定义 class 并在您的设置中启用此中间件(取决于您启动蜘蛛的方式 - 通过 cli 或在 python 脚本中) .
或者您可以定义自己的复制过滤器,请查看 dupefilters.py。但这可能会有点复杂,因为您需要对 scrapy 有一定的了解和经验。
我正在抓取一个新闻网站,该网站提取新闻数据并将其转储到 MongoDB。
我的蜘蛛是用以下规则定义的:
rules = [Rule(
LinkExtractor(
allow=["foo.tv/en/*",
"https://fooports.tv/*"] # only such urls
我目前所做的是它从数据库中获取已经抓取的 urls 并且如果在数据库中找到这些 urls 则不处理这些 urls 例如:
urls_visited = get_visited_urls() # Fetches from MongoDB
if response.url not in urls_visited:
# do scraping here
我正在寻找的是有什么方法可以让蜘蛛跳过那些已经被抓取的 urls。我想尝试通过不查看那些已经处理过的 url 来减少抓取时间。我知道规则中有拒绝功能,但不确定在这种情况下如何使用它。
我已经包含了下载器中间件自定义 class 来过滤掉已经被抓取的请求:
class NewsCrawlerDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
def __init__(self):
self.urls_visited = get_visited_urls() # from database
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
# Here we check if url has already been scraped,
# if not process the requests
if request.url in self.urls_visited:
logging.info('ignoring url %s', request.url)
raise IgnoreRequest()
else:
return request
我的中间件订单在settings.py
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'news_crawler.middlewares.NewsCrawlerDownloaderMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
然而,当它首先尝试抓取时 url 它给了我以下错误:
ERROR: Error downloading <GET https://arynews.tv/robots.txt>: maximum recursion depth exceeded while calling a Python object
关于如何正确使用我的自定义下载器中间件来过滤掉 url 的任何想法。
您可以创建一个下载器中间件,它将根据您的数据库查询执行请求过滤。查看 documentation。
在这种情况下,您需要使用 process_request(request, spider)
方法定义 class 并在您的设置中启用此中间件(取决于您启动蜘蛛的方式 - 通过 cli 或在 python 脚本中) .
或者您可以定义自己的复制过滤器,请查看 dupefilters.py。但这可能会有点复杂,因为您需要对 scrapy 有一定的了解和经验。