Scrapy通过json标签获取数据
Scrapy get data through json tags
# -*- coding: utf-8 -*-
import scrapy
from ..items import HomedepotItem
import re
import pandas as pd
import requests
import json
from bs4 import BeautifulSoup
class HomedepotSpider(scrapy.Spider):
name = 'homeDepot'
start_urls = ['https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560']
def parse(self, response):
for item in self.parseHomeDepot(response):
yield item
pass
def parseHomeDepot(self, response):
item = HomedepotItem() #items from items.py
jsonresponse = json.loads(response.text)
productPrice = jsonresponse(["offers"][0]["price"])
#item['productPrice'] = productPrice #display price and assign to variable
yield item
我正在尝试解析来自此网页 json 的数据。我之前回答了一个关于 json 的问题,并且 ["offers"]["prices"] 是可行的方法,因为网页的 json 是
"offers":{"@type":"Offer","url":"https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560","priceCurrency":"USD","price":1449.95,"priceValidUntil":"4/7/2021","availability":"https://schema.org/InStock"}
所以现在我得到错误:raise JSONDecodeError("Expecting value", s, err.value) from None
如有任何帮助,我们将不胜感激!
您收到此错误是因为您不能仅使用纯 response.text
.
简单地获取 <script>
标记中的 JSON
你要的JSON
在type
application/ld+json
的第一个script
标签里。
您必须定位该特定标签,然后然后用json.loads
解析它。
例如:
# -*- coding: utf-8 -*-
import json
import scrapy
class HomedepotSpider(scrapy.Spider):
name = 'homeDepot'
start_urls = ['https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560']
def parse(self, response):
script_tag = response.xpath('//script[@type="application/ld+json"][1]/text()').get()
yield json.loads(script_tag)
这是来自 scrapy shell
的示例:
scrapy shell 'https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560'
...
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7f2d56604160>
[s] item {}
[s] request <GET https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560>
[s] response <200 https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560>
[s] settings <scrapy.settings.Settings object at 0x7f2d56680ac0>
[s] spider <DefaultSpider 'default' at 0x7f2d56105850>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>> script_tag = response.xpath('//script[@type="application/ld+json"][1]/text()').get()
>>> import json
>>> json.loads(script_tag)["offers"]
{'@type': 'Offer', 'url': 'https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560', 'priceCurrency': 'USD', 'price': 1449.95, 'priceValidUntil': '4/12/2021', 'availability': 'https://schema.org/InStock'}
>>> json.loads(script_tag)["offers"]["price"]
1449.95
# -*- coding: utf-8 -*-
import scrapy
from ..items import HomedepotItem
import re
import pandas as pd
import requests
import json
from bs4 import BeautifulSoup
class HomedepotSpider(scrapy.Spider):
name = 'homeDepot'
start_urls = ['https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560']
def parse(self, response):
for item in self.parseHomeDepot(response):
yield item
pass
def parseHomeDepot(self, response):
item = HomedepotItem() #items from items.py
jsonresponse = json.loads(response.text)
productPrice = jsonresponse(["offers"][0]["price"])
#item['productPrice'] = productPrice #display price and assign to variable
yield item
我正在尝试解析来自此网页 json 的数据。我之前回答了一个关于 json 的问题,并且 ["offers"]["prices"] 是可行的方法,因为网页的 json 是
"offers":{"@type":"Offer","url":"https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560","priceCurrency":"USD","price":1449.95,"priceValidUntil":"4/7/2021","availability":"https://schema.org/InStock"}
所以现在我得到错误:raise JSONDecodeError("Expecting value", s, err.value) from None
如有任何帮助,我们将不胜感激!
您收到此错误是因为您不能仅使用纯 response.text
.
<script>
标记中的 JSON
你要的JSON
在type
application/ld+json
的第一个script
标签里。
您必须定位该特定标签,然后然后用json.loads
解析它。
例如:
# -*- coding: utf-8 -*-
import json
import scrapy
class HomedepotSpider(scrapy.Spider):
name = 'homeDepot'
start_urls = ['https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560']
def parse(self, response):
script_tag = response.xpath('//script[@type="application/ld+json"][1]/text()').get()
yield json.loads(script_tag)
这是来自 scrapy shell
的示例:
scrapy shell 'https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560'
...
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7f2d56604160>
[s] item {}
[s] request <GET https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560>
[s] response <200 https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560>
[s] settings <scrapy.settings.Settings object at 0x7f2d56680ac0>
[s] spider <DefaultSpider 'default' at 0x7f2d56105850>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>> script_tag = response.xpath('//script[@type="application/ld+json"][1]/text()').get()
>>> import json
>>> json.loads(script_tag)["offers"]
{'@type': 'Offer', 'url': 'https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560', 'priceCurrency': 'USD', 'price': 1449.95, 'priceValidUntil': '4/12/2021', 'availability': 'https://schema.org/InStock'}
>>> json.loads(script_tag)["offers"]["price"]
1449.95