Scrapy 通过 API 请求:[产品目录页面 > 产品页面] > 分页

Scrapy via API requests: [Product Catalogue page > Product Page] > Pagination

我正在尝试使用 API 请求从产品页面抓取产品详细信息。我可以毫无问题地访问产品目录页面并获取每个产品的请求网址。但是,我在将它们从一个函数正确解析到另一个函数时遇到了一些问题。

我想我遗漏了几行代码,或者 self.parse 的使用不正确。如果我发送新请求(针对每个产品页面),我是否也应该发送新的 header 请求?因为产品页面与产品目录页面中的请求 header 不同。我该怎么做?

非常感谢您的反馈和帮助!非常感谢。

这是我目前的工作:https://pastebin.com/H1yyDiDL

import scrapy
from scrapy.exceptions import CloseSpider
import json

class HtmshopeeSpider(scrapy.Spider):
    name = 'shopeeitem2'

    headers={
        'authority': 'shopee.com.my',
        'method': 'GET',
        'path': '/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=0&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2',
        'scheme': 'https',
        'accept': '*/*',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'en-US,en;q=0.9',
        'cache-control': 'no-cache',
        'cookie': 'private_content_version=75d921dc5d1fc85c97d8d9876d6e58b2; _fbp=fb.2.1626162049790.1893904607; _ga=GA1.3.518387377.1626162051; _gid=GA1.3.151467354.1626162051; _gcl_au=1.1.203553443.1626162051; x_axis_main=v_id:017a9ecfb7ba000a4be21b24a20803079001c0710093c$_sn:1$_ss:1$_pn:1%3Bexp-session$_st:1626163851002$ses_id:1626162051002%3Bexp-session',
        'if-none-match-': '55b03-676eb00af72df9e2b38a2976dd41d5ea',
        'pragma': 'no-cache',
        'referer': 'https://shopee.com.my/search?keyword=chantiva&page=0',
        'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
        'sec-ch-ua-mobile': '?0',
        'sec-fetch-dest': 'empty',
        'sec-fetch-mode': 'cors',
        'sec-fetch-site': 'same-origin',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
        'x-api-source': 'pc',
        'x-requested-with': 'XMLHttpRequest',
        'x-shopee-language': 'en'
    }

    def start_requests(self):
        yield scrapy.Request(
            url= 'https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=0&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2',
            headers=self.headers,
            callback=self.parse_products,
            meta={
                'newest':0
            }
        )

    def parse_products(self, response):
        json_resp = json.loads(response.body)
        products = json_resp.get('items')

        for product in products:
            item_id = product.get('item_basic').get('itemid'),
            shop_id = product.get('item_basic').get('shopid')

            yield scrapy.Request(
                url=f"https://shopee.com.my/api/v2/item/get?itemid={item_id}&shopid={shop_id}",
                callback=self.parse_data,
                headers=self.headers
            )

    def parse_data(self, response):
        json_resp = json.loads(response.body)
        datas = json_resp.get('item')

        for data in datas:
            yield {
                'product': data.get('name')
            }



    count= 240000

    next_page = response.meta['newest'] + 60


    if next_page <= count:
        yield scrapy.Request(
            url=f"https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest={next_page}&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2",
            headers=self.headers,
            meta={'newest': next_page}
        )

这是解决方案。实际上,url 包含总计数 123,每页计数 60

代码:

import scrapy
from scrapy.exceptions import CloseSpider
import json

class HtmshopeeSpider(scrapy.Spider):
    name = 'shopeeitem2'

    headers={
        'authority': 'shopee.com.my',
        'method': 'GET',
        'path': '/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=0&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2',
        'scheme': 'https',
        'accept': '*/*',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'en-US,en;q=0.9',
        'cache-control': 'no-cache',
        'cookie': 'private_content_version=75d921dc5d1fc85c97d8d9876d6e58b2; _fbp=fb.2.1626162049790.1893904607; _ga=GA1.3.518387377.1626162051; _gid=GA1.3.151467354.1626162051; _gcl_au=1.1.203553443.1626162051; x_axis_main=v_id:017a9ecfb7ba000a4be21b24a20803079001c0710093c$_sn:1$_ss:1$_pn:1%3Bexp-session$_st:1626163851002$ses_id:1626162051002%3Bexp-session',
        'if-none-match-': '55b03-676eb00af72df9e2b38a2976dd41d5ea',
        'pragma': 'no-cache',
        'referer': 'https://shopee.com.my/search?keyword=chantiva&page=0',
        'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
        'sec-ch-ua-mobile': '?0',
        'sec-fetch-dest': 'empty',
        'sec-fetch-mode': 'cors',
        'sec-fetch-site': 'same-origin',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
        'x-api-source': 'pc',
        'x-requested-with': 'XMLHttpRequest',
        'x-shopee-language': 'en'
    }

    def start_requests(self):
        yield scrapy.Request(
            url= 'https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=0&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2',
            headers=self.headers,
            callback=self.parse_products,
            meta={
                'newest':0
            }
        )

    def parse_products(self, response):
        json_resp = json.loads(response.body)
        products = json_resp.get('items')

        for product in products:
            yield{
                'Name':product.get('item_basic').get('name'),
                'Price':product.get('item_basic').get('price')
                }
        count = json_resp.get('total_count')
        next_page = response.meta['newest'] + 60
        if next_page <= count:

            yield scrapy.Request(
                url=f'https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest={next_page}&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2',
                callback=self.parse_products,
                headers=self.headers,
                meta={'newest': next_page}
            )

输出:总输出的一部分。

{'Name': 'Chantiva Haruan Tablet SS Plus 450mg (60 Tabs) Cepat sembuh luka', 'Price': 9000000}
2021-08-10 12:40:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=0&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>  
{'Name': 'CHANTIVA 750MG 30 TABLETS (EXP:04/23)', 'Price': 8490000}
2021-08-10 12:40:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=0&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>  
{'Name': 'CHANTIVA TABLET HARUAN SS PLUS 450MG (EXP: 03/2022)', 'Price': 1389000}

{'Name': 'CHANTIVA HARUAN SS PLUS TAB 60S', 'Price': 7550000}
2021-08-10 12:40:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=60&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2> 
{'Name': "CHANTIVA 450MG 1 STRIP 10'S (IKAN HARUAN)", 'Price': 2000000}
2021-08-10 12:40:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=60&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2> 
{'Name': 'CHANTIVA TABLET 750MG (EXP 04/23)', 'Price': 3800000}
2021-08-10 12:40:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=60&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2> 
{'Name': "TrueLifeSciences® CHANTIVA Haruan SS Plus 450mg Tablet 60's", 'Price': 8460000}
2021-08-10 12:40:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=60&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2> 
{'Name': 'Chantiva 450mg Tablet', 'Price': 9400000}
2021-08-10 12:40:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=60&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2> 
{'Name': 'Chantiva Tablet Haruan SS Plus 450mg 60s', 'Price': 8565000}
2021-08-10 12:40:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=60&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2> 
{'Name': 'Chantiva Skin Fix Cream 20g x2 (Twin Pack)', 'Price': 5380000}
2021-08-10 12:40:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=60&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2> 
{'Name': "CHANTIVA 450MG TABLET 60'S", 'Price': 7690000}
2021-08-10 12:40:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=60&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2> 
{'Name': 'CHANTIVA TABLET HARUAN (450MG/750MG)', 'Price': 2000000}
2021-08-10 12:40:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=60&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2> 
{'Name': 'CHANTIVA 750MG 30 TABLETS (EXP: 09/2022)', 'Price': 8490000}
2021-08-10 12:40:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=120&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2> (referer: https://shopee.com.my/search?keyword=chantiva&page=0)
2021-08-10 12:40:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=120&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>{'Name': 'CHANTIVA TABLET IKAN HARUAN 450MG SAKIT LUTUT SAKIT URAT LUKA 60"S', 'Price': 7490000}
2021-08-10 12:40:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=120&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>{'Name': "[CLEARANCE][WITH FREE GIFT] CHANTIVA TABLET HARUAN SS PLUS 60'S (EXP:02/2021)", 'Price': 7600000}     
2021-08-10 12:40:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=120&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>{'Name': "CHANTIVA 450MG TABLET 6X10'S by strip Exp:10/21", 'Price': 990000}
2021-08-10 12:40:32 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-10 12:40:32 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3242,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 40725,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'elapsed_time_seconds': 4.219452,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 8, 10, 6, 40, 32, 976939),
 'httpcompression/response_bytes': 377162,
 'httpcompression/response_count': 3,
 'item_scraped_count': 123,