如何在 1 个蜘蛛中抓取多页?
How to crawl multi-page in 1 spider?
我是 Scrapy 的初学者,我想在 1 个蜘蛛中构建一个包含多个页面的抓取作业。
仅供参考:这是一个电子商务页面,作业应该逐页查找所有产品。对于找到的每个产品,打开它自己的 URL 以抓取该特定产品的数据。
代码应该如下:
- 打开包含 URL 的页面(第 1 页)
- 找到所有产品
- 循环每个产品 -> 跟随其 URL -> 抓取数据
- 找到下一页
- 关注下一页
这是我的代码
import scrapy
class QuotesSpider(scrapy.Spider):
name = "acfc_spider"
#List of URL
def start_reqeust(self):
urls =[
"https://www.acfc.com.vn/nam/promotion.html?p=2",
"https://www.acfc.com.vn/nu/promotion.html?p=1",
"https://www.acfc.com.vn/outlet.html?p=1",
"https://www.acfc.com.vn/tre-em/khuyen-mai.html?p=1"
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
#Crawl the product detail
def parse_product_detail(self, response):
with open('datail_product.txt', 'a') as wr:
wr.write('Crawled this detail product with URL ' + str(response.request.url) + '\n')
#Crawl page after page
def parse(self, response):
with open('general_product.txt', 'a') as wr:
wr.write(response.request.url + '\n')
#Found all products
list_of_product = response.css("li.item.product.product-item a::attr(href)").getall()
#Go to the page of a specific product to do crawl
for i in list_of_product:
yield scrapy.Request(url=i, callback=self.parse_product_detail)
#Go to the next page and repeat
current_page = (response.request.url)[-1:]
next_page = str(int(current_page)+1)
list_of_page = response.css("li.item a.page").xpath("@href").extract()
next_page_url = [i for i in list_of_page if i[-1] == next_page]
yield response.follow(next_page_url, self.parse)
现在我只是让它将日志写入文件.txt
但是当我命令 scrapy crawl acfc_spider
时,我得到了这个
2021-11-25 16:39:22 [scrapy.core.engine] INFO: Spider opened
2021-11-25 16:39:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-11-25 16:39:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-11-25 16:39:22 [scrapy.core.engine] INFO: Closing spider (finished)
2021-11-25 16:39:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.005,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 11, 25, 9, 39, 22, 162166),
'log_count/INFO': 10,
'start_time': datetime.datetime(2021, 11, 25, 9, 39, 22, 157166)}
2021-11-25 16:39:22 [scrapy.core.engine] INFO: Spider closed (finished)
最后,我没有找到任何我的日志 .txt
文件。肯定有问题,但我不知道为什么。
请帮忙!
您有错字:start_reqeust
而不是 start_requests
。
其次,您正在尝试关注列表:
next_page_url = [i for i in list_of_page if i[-1] == next_page]
yield response.follow(next_page_url, self.parse)
此外,您不需要所有的页面就可以到达下一页。
这是完整的代码(顺便说一句,考虑使用 scrapy 的 FEEDS 来获取你的抓取结果):
import scrapy
class QuotesSpider(scrapy.Spider):
name = "acfc_spider"
#List of URL
def start_requests(self):
urls = [
"https://www.acfc.com.vn/nam/promotion.html?p=1",
"https://www.acfc.com.vn/nu/promotion.html?p=1",
"https://www.acfc.com.vn/outlet.html?p=1",
"https://www.acfc.com.vn/tre-em/khuyen-mai.html?p=1"
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
#Crawl the product detail
def parse_product_detail(self, response):
with open('datail_product.txt', 'a') as wr:
wr.write(f'Crawled this detail product with URL {str(response.request.url)}\n')
#Crawl page after page
def parse(self, response):
with open('general_product.txt', 'a') as wr:
wr.write(response.request.url + '\n')
#Found all products
list_of_product = response.css('a.product-item-link::attr(href)').getall()
#Go to the page of a specific product to do crawl
for i in list_of_product:
yield scrapy.Request(url=i, callback=self.parse_product_detail)
#Go to the next page and repeat
next_page_url = response.css('.next::attr(href)').get()
if next_page_url:
yield response.follow(next_page_url)
我是 Scrapy 的初学者,我想在 1 个蜘蛛中构建一个包含多个页面的抓取作业。
仅供参考:这是一个电子商务页面,作业应该逐页查找所有产品。对于找到的每个产品,打开它自己的 URL 以抓取该特定产品的数据。
代码应该如下:
- 打开包含 URL 的页面(第 1 页)
- 找到所有产品
- 循环每个产品 -> 跟随其 URL -> 抓取数据
- 找到下一页
- 关注下一页
这是我的代码
import scrapy
class QuotesSpider(scrapy.Spider):
name = "acfc_spider"
#List of URL
def start_reqeust(self):
urls =[
"https://www.acfc.com.vn/nam/promotion.html?p=2",
"https://www.acfc.com.vn/nu/promotion.html?p=1",
"https://www.acfc.com.vn/outlet.html?p=1",
"https://www.acfc.com.vn/tre-em/khuyen-mai.html?p=1"
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
#Crawl the product detail
def parse_product_detail(self, response):
with open('datail_product.txt', 'a') as wr:
wr.write('Crawled this detail product with URL ' + str(response.request.url) + '\n')
#Crawl page after page
def parse(self, response):
with open('general_product.txt', 'a') as wr:
wr.write(response.request.url + '\n')
#Found all products
list_of_product = response.css("li.item.product.product-item a::attr(href)").getall()
#Go to the page of a specific product to do crawl
for i in list_of_product:
yield scrapy.Request(url=i, callback=self.parse_product_detail)
#Go to the next page and repeat
current_page = (response.request.url)[-1:]
next_page = str(int(current_page)+1)
list_of_page = response.css("li.item a.page").xpath("@href").extract()
next_page_url = [i for i in list_of_page if i[-1] == next_page]
yield response.follow(next_page_url, self.parse)
现在我只是让它将日志写入文件.txt
但是当我命令 scrapy crawl acfc_spider
时,我得到了这个
2021-11-25 16:39:22 [scrapy.core.engine] INFO: Spider opened
2021-11-25 16:39:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-11-25 16:39:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-11-25 16:39:22 [scrapy.core.engine] INFO: Closing spider (finished)
2021-11-25 16:39:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.005,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 11, 25, 9, 39, 22, 162166),
'log_count/INFO': 10,
'start_time': datetime.datetime(2021, 11, 25, 9, 39, 22, 157166)}
2021-11-25 16:39:22 [scrapy.core.engine] INFO: Spider closed (finished)
最后,我没有找到任何我的日志 .txt
文件。肯定有问题,但我不知道为什么。
请帮忙!
您有错字:start_reqeust
而不是 start_requests
。
其次,您正在尝试关注列表:
next_page_url = [i for i in list_of_page if i[-1] == next_page]
yield response.follow(next_page_url, self.parse)
此外,您不需要所有的页面就可以到达下一页。
这是完整的代码(顺便说一句,考虑使用 scrapy 的 FEEDS 来获取你的抓取结果):
import scrapy
class QuotesSpider(scrapy.Spider):
name = "acfc_spider"
#List of URL
def start_requests(self):
urls = [
"https://www.acfc.com.vn/nam/promotion.html?p=1",
"https://www.acfc.com.vn/nu/promotion.html?p=1",
"https://www.acfc.com.vn/outlet.html?p=1",
"https://www.acfc.com.vn/tre-em/khuyen-mai.html?p=1"
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
#Crawl the product detail
def parse_product_detail(self, response):
with open('datail_product.txt', 'a') as wr:
wr.write(f'Crawled this detail product with URL {str(response.request.url)}\n')
#Crawl page after page
def parse(self, response):
with open('general_product.txt', 'a') as wr:
wr.write(response.request.url + '\n')
#Found all products
list_of_product = response.css('a.product-item-link::attr(href)').getall()
#Go to the page of a specific product to do crawl
for i in list_of_product:
yield scrapy.Request(url=i, callback=self.parse_product_detail)
#Go to the next page and repeat
next_page_url = response.css('.next::attr(href)').get()
if next_page_url:
yield response.follow(next_page_url)