Scrapy - 为基于 AJAX 的网站请求负载格式和类型
Scrapy - Request Payload format and types for AJAX based websites
我正在尝试抓取 noon.com。这是我有兴趣抓取的产品 https://www.noon.com/uae-en/face-and-beard-wash-multicolour-80ml/N22130693A/p?o=f7adb85c3296590b.
除Ratings/Review数据外,我能够获取产品的所有信息。这里的问题是网站通过 API link https://www.noon.com/_svc/reviews/fetch/v1/product-reviews/list 加载评级数据,这基本上是 POST 请求方法。
我尝试在 scrapy 请求中包含 headers 和适当的负载。但我收到 400、405 --- HTTP 状态代码未处理或不允许作为响应。
这就是我尝试提取收视率数据的方式
def start_requests(self):
headers = {"authority": "www.noon.com",
"method": "POST",
"path": "/_svc/reviews/fetch/v1/product-reviews/list",
"scheme": "https",
"accept": "application/json, text/plain, */*",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9",
"cache-control": "no-cache, max-age=0, must-revalidate, no-store",
"content-type": "application/json",
"origin": "https://www.noon.com",
"referer": "https://www.noon.com/uae-en/face-and-beard-wash-multicolour-80ml/N22130693A/p?o=f7adb85c3296590b",
"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
}
url = "https://www.noon.com/_svc/reviews/fetch/v1/product-reviews/list"
payload = [{"catalogCode":"noon","sku":"N22130693A","lang":None,"ratings":[1,2,3,4,5],"provideBreakdown":True,"page":1}]
yield scrapy.Request(url,method = "POST",body=json.dumps(payload),headers = headers,callback=self.parse)
def parse(self, response):
data = json.loads(response.body)
print(data)
这个问题有什么解决办法吗?
任何帮助将不胜感激。
我试过了,它对我有用,如果它对你不起作用,可能是你的 IP 被封锁了,可能必须使用代理 api。试试这是否适合你。
def start_requests(self):
return [scrapy.Request(
url='https://www.noon.com/_svc/reviews/fetch/v1/product-reviews/list',
method='POST',
body='{"catalogCode":"noon","sku":"N22130693A","lang":null,"ratings":[1,2,3,4,5],"provideBreakdown":true,"page":1}',
headers={
'content-type': 'application/json'
}
)]
def parse(self, response):
print(response.body)
我的输出:
2020-12-23 13:12:35 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.noon.com/_svc/reviews/fetch/v1/product-reviews/list> (referer: None)
b'{"list":[],"summary":{"rating":5.0,"count":1,"commentCount":0},"breakdown":[{"rating":5.0,"count":1,"commentCount":0}],"languages":[],"pagination":{"totalPages":1,"page":1,"perPage":10}}'
我正在尝试抓取 noon.com。这是我有兴趣抓取的产品 https://www.noon.com/uae-en/face-and-beard-wash-multicolour-80ml/N22130693A/p?o=f7adb85c3296590b.
除Ratings/Review数据外,我能够获取产品的所有信息。这里的问题是网站通过 API link https://www.noon.com/_svc/reviews/fetch/v1/product-reviews/list 加载评级数据,这基本上是 POST 请求方法。
我尝试在 scrapy 请求中包含 headers 和适当的负载。但我收到 400、405 --- HTTP 状态代码未处理或不允许作为响应。
这就是我尝试提取收视率数据的方式
def start_requests(self):
headers = {"authority": "www.noon.com",
"method": "POST",
"path": "/_svc/reviews/fetch/v1/product-reviews/list",
"scheme": "https",
"accept": "application/json, text/plain, */*",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9",
"cache-control": "no-cache, max-age=0, must-revalidate, no-store",
"content-type": "application/json",
"origin": "https://www.noon.com",
"referer": "https://www.noon.com/uae-en/face-and-beard-wash-multicolour-80ml/N22130693A/p?o=f7adb85c3296590b",
"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
}
url = "https://www.noon.com/_svc/reviews/fetch/v1/product-reviews/list"
payload = [{"catalogCode":"noon","sku":"N22130693A","lang":None,"ratings":[1,2,3,4,5],"provideBreakdown":True,"page":1}]
yield scrapy.Request(url,method = "POST",body=json.dumps(payload),headers = headers,callback=self.parse)
def parse(self, response):
data = json.loads(response.body)
print(data)
这个问题有什么解决办法吗? 任何帮助将不胜感激。
我试过了,它对我有用,如果它对你不起作用,可能是你的 IP 被封锁了,可能必须使用代理 api。试试这是否适合你。
def start_requests(self):
return [scrapy.Request(
url='https://www.noon.com/_svc/reviews/fetch/v1/product-reviews/list',
method='POST',
body='{"catalogCode":"noon","sku":"N22130693A","lang":null,"ratings":[1,2,3,4,5],"provideBreakdown":true,"page":1}',
headers={
'content-type': 'application/json'
}
)]
def parse(self, response):
print(response.body)
我的输出:
2020-12-23 13:12:35 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.noon.com/_svc/reviews/fetch/v1/product-reviews/list> (referer: None)
b'{"list":[],"summary":{"rating":5.0,"count":1,"commentCount":0},"breakdown":[{"rating":5.0,"count":1,"commentCount":0}],"languages":[],"pagination":{"totalPages":1,"page":1,"perPage":10}}'