Python 请求有效时,Scrapy FormRequest 返回 400 错误
Scrapy FormRequest returning 400 error while Python Requests works
通过 Scrapy FormRequest
发送 Post
请求会导致 400 错误,而通过 Python Requests 发送相同的请求会成功。
请求 headers
和 params
不可能是问题,因为它们处理请求。 Scrapy 中的什么可以打破这个?
下面的代码是 运行 scrapy shell:
url = 'https://www.tripadvisor.co.uk/ShowUserReviews-g2151208-d19219570-r792748373-Tumanyan_Khinkali_at_Tsaghkadzor-Tsakhkadzor_Kotayk_Province.html'
headers = {
'authority': 'www.tripadvisor.co.uk',
'method': 'POST',
'scheme': 'https',
'accept': 'text/html, */*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'content-length': '102',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'dnt': '1',
'origin': 'https://www.tripadvisor.co.uk',
'pragma': 'no-cache',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
}
params = {
'returnTo': '#REVIEWS',
'filterLang': 'ALL',
'changeSet': 'REVIEW_LIST'
}
Scrapy FormRequst
returns 400 错误。
In [10]: req = scrapy.http.FormRequest(
...: url,
...: method='POST',
...: formdata=params,
...: headers=headers)
In [11]: fetch(req)
2021-06-26 21:28:18 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://www.tripadvisor.co.uk/ShowUserReviews-g2151208-d19219570-r792748373-Tumanyan_Khinkali_at_Tsaghkadzor-Tsakhkadzor_Kotayk_Province.html> (referer: None)
Python 请求 returns 200,我可以访问内容。
In [17]: r = requests.post(url=url, headers=headers, json=params)
2021-06-26 21:30:02 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.tripadvisor.co.uk:443
2021-06-26 21:30:04 [urllib3.connectionpool] DEBUG: https://www.tripadvisor.co.uk:443 "POST /ShowUserReviews-g2151208-d19219570-r792748373-Tumanyan_Khinkali_at_Tsaghkadzor-Tsakhkadzor_Kotayk_Province.html HTTP/1.1" 200 16360
In [18]: r.status_code
Out[18]: 200
由于我无法从这里访问 url,您可以尝试以下代码是否有效,或者 not.You 还必须添加用户代理。
import scrapy
class ReviewsSpider(scrapy.Spider):
name = 'reviews'
body = "reqNum=1&isLastPoll=false¶mSeqId=0&waitTime=41&changeSet=REVIEW_LIST&puid=YNgN2QokGScAA0-MH9MAAAIQ"
def start_requests(self):
yield scrapy.Request(
url = 'https://www.tripadvisor.co.uk/ShowUserReviews-g2151208-d19219570-r791416821-Tumanyan_Khinkali_at_Tsaghkadzor-Tsakhkadzor_Kotayk_Province.html',
method = "POST",
body = self.body,
callback = self.parse,
headers = {
'content-type': 'application/x-www-form-urlencoded',
'x-puid': 'YNgN2QokGScAA0-MH9MAAAIQ',
'x-requested-with': 'XMLHttpRequest'
}
)
def parse(self, response):
pass
通过 Scrapy FormRequest
发送 Post
请求会导致 400 错误,而通过 Python Requests 发送相同的请求会成功。
请求 headers
和 params
不可能是问题,因为它们处理请求。 Scrapy 中的什么可以打破这个?
下面的代码是 运行 scrapy shell:
url = 'https://www.tripadvisor.co.uk/ShowUserReviews-g2151208-d19219570-r792748373-Tumanyan_Khinkali_at_Tsaghkadzor-Tsakhkadzor_Kotayk_Province.html'
headers = {
'authority': 'www.tripadvisor.co.uk',
'method': 'POST',
'scheme': 'https',
'accept': 'text/html, */*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'content-length': '102',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'dnt': '1',
'origin': 'https://www.tripadvisor.co.uk',
'pragma': 'no-cache',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
}
params = {
'returnTo': '#REVIEWS',
'filterLang': 'ALL',
'changeSet': 'REVIEW_LIST'
}
Scrapy FormRequst
returns 400 错误。
In [10]: req = scrapy.http.FormRequest(
...: url,
...: method='POST',
...: formdata=params,
...: headers=headers)
In [11]: fetch(req)
2021-06-26 21:28:18 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://www.tripadvisor.co.uk/ShowUserReviews-g2151208-d19219570-r792748373-Tumanyan_Khinkali_at_Tsaghkadzor-Tsakhkadzor_Kotayk_Province.html> (referer: None)
Python 请求 returns 200,我可以访问内容。
In [17]: r = requests.post(url=url, headers=headers, json=params)
2021-06-26 21:30:02 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.tripadvisor.co.uk:443
2021-06-26 21:30:04 [urllib3.connectionpool] DEBUG: https://www.tripadvisor.co.uk:443 "POST /ShowUserReviews-g2151208-d19219570-r792748373-Tumanyan_Khinkali_at_Tsaghkadzor-Tsakhkadzor_Kotayk_Province.html HTTP/1.1" 200 16360
In [18]: r.status_code
Out[18]: 200
由于我无法从这里访问 url,您可以尝试以下代码是否有效,或者 not.You 还必须添加用户代理。
import scrapy
class ReviewsSpider(scrapy.Spider):
name = 'reviews'
body = "reqNum=1&isLastPoll=false¶mSeqId=0&waitTime=41&changeSet=REVIEW_LIST&puid=YNgN2QokGScAA0-MH9MAAAIQ"
def start_requests(self):
yield scrapy.Request(
url = 'https://www.tripadvisor.co.uk/ShowUserReviews-g2151208-d19219570-r791416821-Tumanyan_Khinkali_at_Tsaghkadzor-Tsakhkadzor_Kotayk_Province.html',
method = "POST",
body = self.body,
callback = self.parse,
headers = {
'content-type': 'application/x-www-form-urlencoded',
'x-puid': 'YNgN2QokGScAA0-MH9MAAAIQ',
'x-requested-with': 'XMLHttpRequest'
}
)
def parse(self, response):
pass