Scrapy通过json标签获取数据

Question

# -*- coding: utf-8 -*-
import scrapy
from ..items import HomedepotItem
import re
import pandas as pd
import requests
import json
from bs4 import BeautifulSoup



class HomedepotSpider(scrapy.Spider):
    name = 'homeDepot'


    start_urls = ['https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560']
     


    def parse(self, response):


        for item in self.parseHomeDepot(response):
            yield item
        pass

    def parseHomeDepot(self, response):
        item = HomedepotItem() #items from items.py


        jsonresponse = json.loads(response.text)
        productPrice = jsonresponse(["offers"][0]["price"])
        

     
        #item['productPrice'] = productPrice #display price and assign to variable
   

        yield item

我正在尝试解析来自此网页 json 的数据。我之前回答了一个关于 json 的问题，并且 ["offers"]["prices"] 是可行的方法，因为网页的 json 是

"offers":{"@type":"Offer","url":"https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560","priceCurrency":"USD","price":1449.95,"priceValidUntil":"4/7/2021","availability":"https://schema.org/InStock"}

所以现在我得到错误：raise JSONDecodeError("Expecting value", s, err.value) from None

如有任何帮助，我们将不胜感激！

Answer 1

您收到此错误是因为您不能仅使用纯 response.text.

简单地获取 <script> 标记中的 JSON

你要的JSON在typeapplication/ld+json的第一个script标签里。

您必须定位该特定标签，然后然后用json.loads解析它。

例如：

# -*- coding: utf-8 -*-
import json

import scrapy


class HomedepotSpider(scrapy.Spider):
    name = 'homeDepot'
    start_urls = ['https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560']

    def parse(self, response):
        script_tag = response.xpath('//script[@type="application/ld+json"][1]/text()').get()
        yield json.loads(script_tag)

这是来自 scrapy shell 的示例：

scrapy shell 'https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560'
...

[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f2d56604160>
[s]   item       {}
[s]   request    <GET https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560>
[s]   response   <200 https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560>
[s]   settings   <scrapy.settings.Settings object at 0x7f2d56680ac0>
[s]   spider     <DefaultSpider 'default' at 0x7f2d56105850>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>> script_tag = response.xpath('//script[@type="application/ld+json"][1]/text()').get()
>>> import json
>>> json.loads(script_tag)["offers"]
{'@type': 'Offer', 'url': 'https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560', 'priceCurrency': 'USD', 'price': 1449.95, 'priceValidUntil': '4/12/2021', 'availability': 'https://schema.org/InStock'}
>>> json.loads(script_tag)["offers"]["price"]
1449.95

Scrapy通过json标签获取数据

Scrapy get data through json tags

python

beautifulsoup

scrapy

web-scraping