如何在 1 个蜘蛛中抓取多页?

How to crawl multi-page in 1 spider?

我是 Scrapy 的初学者,我想在 1 个蜘蛛中构建一个包含多个页面的抓取作业。

仅供参考:这是一个电子商务页面,作业应该逐页查找所有产品。对于找到的每个产品,打开它自己的 URL 以抓取该特定产品的数据。

代码应该如下:

  1. 打开包含 URL 的页面(第 1 页)
  2. 找到所有产品
  3. 循环每个产品 -> 跟随其 URL -> 抓取数据
  4. 找到下一页
  5. 关注下一页

这是我的代码

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "acfc_spider"

    #List of URL
    def start_reqeust(self):
        urls =[
            "https://www.acfc.com.vn/nam/promotion.html?p=2",
            "https://www.acfc.com.vn/nu/promotion.html?p=1",
            "https://www.acfc.com.vn/outlet.html?p=1",
            "https://www.acfc.com.vn/tre-em/khuyen-mai.html?p=1"
        ]

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    
    #Crawl the product detail
    def parse_product_detail(self, response):
        with open('datail_product.txt', 'a') as wr:
            wr.write('Crawled this detail product with URL ' + str(response.request.url) + '\n')

    #Crawl page after page
    def parse(self, response):
        with open('general_product.txt', 'a') as wr:
            wr.write(response.request.url + '\n')

        #Found all products
        list_of_product = response.css("li.item.product.product-item  a::attr(href)").getall()

        #Go to the page of a specific product to do crawl
        for i in list_of_product:
            yield scrapy.Request(url=i, callback=self.parse_product_detail)

        #Go to the next page and repeat
        current_page = (response.request.url)[-1:]
        next_page = str(int(current_page)+1)
        list_of_page = response.css("li.item a.page").xpath("@href").extract()
        next_page_url = [i for i in list_of_page if i[-1] == next_page]
        yield response.follow(next_page_url, self.parse)

现在我只是让它将日志写入文件.txt

但是当我命令 scrapy crawl acfc_spider 时,我得到了这个

2021-11-25 16:39:22 [scrapy.core.engine] INFO: Spider opened
2021-11-25 16:39:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-11-25 16:39:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-11-25 16:39:22 [scrapy.core.engine] INFO: Closing spider (finished)
2021-11-25 16:39:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.005,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 11, 25, 9, 39, 22, 162166),
 'log_count/INFO': 10,
 'start_time': datetime.datetime(2021, 11, 25, 9, 39, 22, 157166)}
2021-11-25 16:39:22 [scrapy.core.engine] INFO: Spider closed (finished)

最后,我没有找到任何我的日志 .txt 文件。肯定有问题,但我不知道为什么。

请帮忙!

您有错字:start_reqeust 而不是 start_requests

其次,您正在尝试关注列表:

next_page_url = [i for i in list_of_page if i[-1] == next_page]
yield response.follow(next_page_url, self.parse)

此外,您不需要所有的页面就可以到达下一页。

这是完整的代码(顺便说一句,考虑使用 scrapy 的 FEEDS 来获取你的抓取结果):

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "acfc_spider"

    #List of URL
    def start_requests(self):
        urls = [
            "https://www.acfc.com.vn/nam/promotion.html?p=1",
            "https://www.acfc.com.vn/nu/promotion.html?p=1",
            "https://www.acfc.com.vn/outlet.html?p=1",
            "https://www.acfc.com.vn/tre-em/khuyen-mai.html?p=1"
        ]

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    #Crawl the product detail
    def parse_product_detail(self, response):
        with open('datail_product.txt', 'a') as wr:
            wr.write(f'Crawled this detail product with URL {str(response.request.url)}\n')

    #Crawl page after page
    def parse(self, response):
        with open('general_product.txt', 'a') as wr:
            wr.write(response.request.url + '\n')

        #Found all products
        list_of_product = response.css('a.product-item-link::attr(href)').getall()

        #Go to the page of a specific product to do crawl
        for i in list_of_product:
            yield scrapy.Request(url=i, callback=self.parse_product_detail)

        #Go to the next page and repeat
        next_page_url = response.css('.next::attr(href)').get()
        if next_page_url:
            yield response.follow(next_page_url)