Scrapy - 满足条件后按顺序抓取网址
Scrapy - Crawl urls sequentially after a condition is met
我有一个蜘蛛,它以四个不同的 start_urls
开始,然后继续在里面爬行某些 link。它们都具有 相同的域和结构 ,唯一不同的是它们之间的查询参数。我使用两条规则:一条用于打开和解析每个 link,一条用于分页。
我的问题是:由于分页产生的大量内容,我不想抓取所有 link,因此我需要检查抓取的每个 link 的条件(出版物年),一旦那年与我想要的年份不同,蜘蛛就应该忽略属于那个 start_url 的所有剩余 link 的爬行,然后继续前进到 links 由第二个 start_url
生成。我该怎么做呢?这是我的蜘蛛的代码:
class articleSpider(CrawlSpider):
name = 'article'
allowed_domains = ['website.com']
start_urls = [
'https://www.website.com/search/?category=value1',
'https://www.website.com/search/?category=value2',
'https://www.website.com/search/?category=value3',
'https://www.website.com/search/?category=value4',
]
rules = (
Rule(
LinkExtractor(
restrict_xpaths="//div[@class='results-post']/article/a"
),
callback='parse_item',
follow=True,
),
Rule(
LinkExtractor(
restrict_xpaths="//section[@class='results-navi'][1]/div/div[@class='prevpageNav left']"
)
)
)
def parse_item(self, response):
name = response.url.strip('/').split('/')[-1]
date = response.xpath("//section/p/time/@datetime").get()[:4]
if date == '2020':
with open(f'./src/data/{name}.html', 'wb') as f:
f.write(response.text.encode('utf8'))
return
在此先感谢您的帮助。
我不知道有什么简单的方法可以实现这一点,但也许下面的(未经测试的)代码可以帮助您入门。
逻辑如下:
- 覆盖 start_requests 以仅从第一个开始-url 仅
- 传递 meta
中的其他开始-url
- 在解析方法中,获取项目-urls 和下一页url
- 浏览第 url 项。当您进入 2020 年时,它会经过 item_urls(如果您 运行 超出项目-url,则进入下一页 url)。遇到不一样的年份就去下一个start_url.
from scrapy import Spider, Request
class articleSpider(Spider):
name = 'article'
allowed_domains = ['website.com']
start_urls = [
'https://www.website.com/search/?category=value1',
'https://www.website.com/search/?category=value2',
'https://www.website.com/search/?category=value3',
'https://www.website.com/search/?category=value4',
]
def start_requests(self):
start_urls = self.start_urls
start_url = start_urls.pop()
meta = {'start_urls': start_urls}
yield Request(start_url, callback=self.parse, meta=meta)
def parse(self, response):
start_urls = response.meta['start_urls']
# get item-urls
item_urls = response.xpath(
'//div[@class="results-post"]/article/a'
).extract()
# get next page-url
next_page = response.xpath(
'//section[@class="results-navi"][1]/div/div[@class="prevpageNav left"]'
).extract_first()
# pass the item-urls and next page in the meta
item_url = item_urls.pop()
meta = {
'next_page': next_page,
'item_urls': item_urls,
'start_urls': start_urls
}
yield Request(item_url, self.parse_item, meta=meta)
def parse_item(self, response):
item_urls = response.meta['item_urls']
next_page = response.meta['next_page']
start_urls = response.meta['start_urls']
name = response.url.strip('/').split('/')[-1]
date = response.xpath("//section/p/time/@datetime").get()[:4]
if date == '2020':
with open(f'./src/data/{name}.html', 'wb') as f:
f.write(response.text.encode('utf8'))
try:
item_url = item_urls.pop()
except IndexError:
# all items are done - we go to next page
if next_page:
meta = {'start_urls': start_urls}
yield Request(next_page, self.parse, meta=meta)
else:
# no pages left, go to next start_url
try:
start_url = start_urls.pop()
except IndexError:
# nothing left to do
return
else:
meta = {'start_urls': start_urls}
yield Request(start_url, self.parse, meta=meta)
else:
# still items left to process
meta = {
'next_page': next_page,
'item_urls': item_urls
}
yield Request(item_url, self.parse_item, meta=meta)
else:
# go to next start_url
try:
start_url = start_urls.pop()
except IndexError:
# nothing left to do
return
else:
meta = {'start_urls': start_urls}
yield Request(start_url, self.parse, meta=meta)
我有一个蜘蛛,它以四个不同的 start_urls
开始,然后继续在里面爬行某些 link。它们都具有 相同的域和结构 ,唯一不同的是它们之间的查询参数。我使用两条规则:一条用于打开和解析每个 link,一条用于分页。
我的问题是:由于分页产生的大量内容,我不想抓取所有 link,因此我需要检查抓取的每个 link 的条件(出版物年),一旦那年与我想要的年份不同,蜘蛛就应该忽略属于那个 start_url 的所有剩余 link 的爬行,然后继续前进到 links 由第二个 start_url
生成。我该怎么做呢?这是我的蜘蛛的代码:
class articleSpider(CrawlSpider):
name = 'article'
allowed_domains = ['website.com']
start_urls = [
'https://www.website.com/search/?category=value1',
'https://www.website.com/search/?category=value2',
'https://www.website.com/search/?category=value3',
'https://www.website.com/search/?category=value4',
]
rules = (
Rule(
LinkExtractor(
restrict_xpaths="//div[@class='results-post']/article/a"
),
callback='parse_item',
follow=True,
),
Rule(
LinkExtractor(
restrict_xpaths="//section[@class='results-navi'][1]/div/div[@class='prevpageNav left']"
)
)
)
def parse_item(self, response):
name = response.url.strip('/').split('/')[-1]
date = response.xpath("//section/p/time/@datetime").get()[:4]
if date == '2020':
with open(f'./src/data/{name}.html', 'wb') as f:
f.write(response.text.encode('utf8'))
return
在此先感谢您的帮助。
我不知道有什么简单的方法可以实现这一点,但也许下面的(未经测试的)代码可以帮助您入门。 逻辑如下:
- 覆盖 start_requests 以仅从第一个开始-url 仅
- 传递 meta 中的其他开始-url
- 在解析方法中,获取项目-urls 和下一页url
- 浏览第 url 项。当您进入 2020 年时,它会经过 item_urls(如果您 运行 超出项目-url,则进入下一页 url)。遇到不一样的年份就去下一个start_url.
from scrapy import Spider, Request
class articleSpider(Spider):
name = 'article'
allowed_domains = ['website.com']
start_urls = [
'https://www.website.com/search/?category=value1',
'https://www.website.com/search/?category=value2',
'https://www.website.com/search/?category=value3',
'https://www.website.com/search/?category=value4',
]
def start_requests(self):
start_urls = self.start_urls
start_url = start_urls.pop()
meta = {'start_urls': start_urls}
yield Request(start_url, callback=self.parse, meta=meta)
def parse(self, response):
start_urls = response.meta['start_urls']
# get item-urls
item_urls = response.xpath(
'//div[@class="results-post"]/article/a'
).extract()
# get next page-url
next_page = response.xpath(
'//section[@class="results-navi"][1]/div/div[@class="prevpageNav left"]'
).extract_first()
# pass the item-urls and next page in the meta
item_url = item_urls.pop()
meta = {
'next_page': next_page,
'item_urls': item_urls,
'start_urls': start_urls
}
yield Request(item_url, self.parse_item, meta=meta)
def parse_item(self, response):
item_urls = response.meta['item_urls']
next_page = response.meta['next_page']
start_urls = response.meta['start_urls']
name = response.url.strip('/').split('/')[-1]
date = response.xpath("//section/p/time/@datetime").get()[:4]
if date == '2020':
with open(f'./src/data/{name}.html', 'wb') as f:
f.write(response.text.encode('utf8'))
try:
item_url = item_urls.pop()
except IndexError:
# all items are done - we go to next page
if next_page:
meta = {'start_urls': start_urls}
yield Request(next_page, self.parse, meta=meta)
else:
# no pages left, go to next start_url
try:
start_url = start_urls.pop()
except IndexError:
# nothing left to do
return
else:
meta = {'start_urls': start_urls}
yield Request(start_url, self.parse, meta=meta)
else:
# still items left to process
meta = {
'next_page': next_page,
'item_urls': item_urls
}
yield Request(item_url, self.parse_item, meta=meta)
else:
# go to next start_url
try:
start_url = start_urls.pop()
except IndexError:
# nothing left to do
return
else:
meta = {'start_urls': start_urls}
yield Request(start_url, self.parse, meta=meta)