scrapy - 多次解析
scrapy - parsing multiple times
我正在尝试解析一个内容如下的域
第 1 页 - 包含指向 10 篇文章的链接
第 2 页 - 包含指向 10 篇文章的链接
第 3 页 - 包含指向 10 篇文章的链接
等等...
我的工作是解析所有页面上的所有文章。
我的想法- 解析所有页面并将指向所有文章的链接存储在列表中,然后迭代列表并解析链接。
到目前为止,我已经能够遍历页面、解析和收集文章链接。我坚持如何开始解析这个列表。
到目前为止我的代码...
import scrapy
class DhoniSpider(scrapy.Spider):
name = "test"
start_urls = [
"https://www.news18.com/cricketnext/newstopics/ms-dhoni.html"
]
count = 0
def __init__(self, *a, **kw):
super(DhoniSpider, self).__init__(*a, **kw)
self.headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
self.seed_urls = []
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, headers=self.headers, callback=self.parse)
def parse(self, response):
DhoniSpider.count += 1
if DhoniSpider.count > 2 :
# there are many pages, this is just to stop parsing after 2 pages
return
for ul in response.css('div.t_newswrap'):
ref_links = ul.css('div.t_videos_box a.t_videosimg::attr(href)').getall()
self.seed_urls.extend(ref_links)
next_page = response.css('ul.pagination li a.nxt::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, headers=self.headers, callback=self.parse)
def iterate_urls(self):
for link in self.seed_urls:
link = response.urljoin(link)
yield scrapy.Request(link, headers=self.headers, callback=self.parse_page)
def parse_page(self, response):
print("called")
如何迭代我的 self.seed_urls
列表并解析它们?我应该从哪里调用 iterate_urls
函数?
您不必收集数组的链接,您可以在解析它们后立即 yield
一个 scrapy.Request
。因此,您可以修改以下函数来代替 self.seed_urls.extend(ref_links)
:
def iterate_urls(self, seed_urls):
for link in seed_urls:
link = response.urljoin(link)
yield scrapy.Request(link, headers=self.headers, callback=self.parse_page)
并称它为:
...
for ul in response.css('div.t_newswrap'):
ref_links = ul.css('div.t_videos_box a.t_videosimg::attr(href)').getall()
yield iterate_urls(ref_links)
...
通常在这种情况下,不需要像你的iterate_urls
:
那样制作外部函数
def parse(self, response):
DhoniSpider.count += 1
if DhoniSpider.count > 2 :
# there are many pages, this is just to stop parsing after 2 pages
return
for ul in response.css('div.t_newswrap'):
for ref_link in ul.css('div.t_videos_box a.t_videosimg::attr(href)').getall():
yield scrapy.Request(response.urljoin(ref_link), headers=self.headers, callback=self.parse_page, priority = 5)
next_page = response.css('ul.pagination li a.nxt::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, headers=self.headers, callback=self.parse)
def parse_page(self, response):
print("called")
我正在尝试解析一个内容如下的域
第 1 页 - 包含指向 10 篇文章的链接
第 2 页 - 包含指向 10 篇文章的链接
第 3 页 - 包含指向 10 篇文章的链接
等等...
我的工作是解析所有页面上的所有文章。
我的想法- 解析所有页面并将指向所有文章的链接存储在列表中,然后迭代列表并解析链接。
到目前为止,我已经能够遍历页面、解析和收集文章链接。我坚持如何开始解析这个列表。
到目前为止我的代码...
import scrapy
class DhoniSpider(scrapy.Spider):
name = "test"
start_urls = [
"https://www.news18.com/cricketnext/newstopics/ms-dhoni.html"
]
count = 0
def __init__(self, *a, **kw):
super(DhoniSpider, self).__init__(*a, **kw)
self.headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
self.seed_urls = []
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, headers=self.headers, callback=self.parse)
def parse(self, response):
DhoniSpider.count += 1
if DhoniSpider.count > 2 :
# there are many pages, this is just to stop parsing after 2 pages
return
for ul in response.css('div.t_newswrap'):
ref_links = ul.css('div.t_videos_box a.t_videosimg::attr(href)').getall()
self.seed_urls.extend(ref_links)
next_page = response.css('ul.pagination li a.nxt::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, headers=self.headers, callback=self.parse)
def iterate_urls(self):
for link in self.seed_urls:
link = response.urljoin(link)
yield scrapy.Request(link, headers=self.headers, callback=self.parse_page)
def parse_page(self, response):
print("called")
如何迭代我的 self.seed_urls
列表并解析它们?我应该从哪里调用 iterate_urls
函数?
您不必收集数组的链接,您可以在解析它们后立即 yield
一个 scrapy.Request
。因此,您可以修改以下函数来代替 self.seed_urls.extend(ref_links)
:
def iterate_urls(self, seed_urls):
for link in seed_urls:
link = response.urljoin(link)
yield scrapy.Request(link, headers=self.headers, callback=self.parse_page)
并称它为:
...
for ul in response.css('div.t_newswrap'):
ref_links = ul.css('div.t_videos_box a.t_videosimg::attr(href)').getall()
yield iterate_urls(ref_links)
...
通常在这种情况下,不需要像你的iterate_urls
:
def parse(self, response):
DhoniSpider.count += 1
if DhoniSpider.count > 2 :
# there are many pages, this is just to stop parsing after 2 pages
return
for ul in response.css('div.t_newswrap'):
for ref_link in ul.css('div.t_videos_box a.t_videosimg::attr(href)').getall():
yield scrapy.Request(response.urljoin(ref_link), headers=self.headers, callback=self.parse_page, priority = 5)
next_page = response.css('ul.pagination li a.nxt::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, headers=self.headers, callback=self.parse)
def parse_page(self, response):
print("called")