python 的 xhr 抓取,使用 scrapy 但没有数据 return
xhr scraping for python, using scrapy but no data return
好的我之前在这里问过这个问题
python scraping for javascript not working and specific data
而且我似乎可以从提取 xhr 内容中获取数据,在这种情况下,我可以有其他替代方法在不使用 selenium 的情况下进行这种抓取。
import scrapy
import json
class PublicMutual(scrapy.Spider):
name = 'publicmutual'
headers = {'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Cookie': '.ASPXANONYMOUS=u8UpT1xTjt54Tf80JCsS2GqJWf4sPIksbzi5JOaw8TsM7i64n54q8yESMrdk81uj2hjiaMMLSMJAl0LcevRrYNP0XoGlGcGMpgNnmpG6YSMM1jAK0; Analytics_VisitorId=42ce4acb-6501-4828-aa81-74ef126af235; Analytics=SessionId=fc660efe-9e82-4379-afd5-8f77f203ff10&TabId=106&ContentItemId=-1; dnn_IsMobile=False; language=en-US; ASP.NET_SessionId=da1cbmzdgrzitjwntnlu3ioq; __RequestVerificationToken=Ry8wSKybT77XgBmmxuOfGmM4a6_Wy-B1MNKrN5g2zfVB1c6GXlL68ZYWUwZKBvVjyheTWQ2',
'Host': 'www.publicmutual.com.my',
'Referer': 'https://www.publicmutual.com.my/Our-Products/UT-Fund-Prices',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'}
def start_requests(self):
yield scrapy.Request(url='https://www.publicmutual.com.my/Our-Products/UT-Fund-Prices',headers= self.headers,callback=self.parse)
def parse(self, response):
print(json.loads(response.body))
这就是我用作基础的内容,在 运行 这段代码之后我根本没有得到任何输出。我不确定我在这里做错了什么。请帮助
好的,我想我成功了。页面中有一个通过 xhr (...) 自动发送的表单。所以我们只是抓住他们的输入来伪造一个有效载荷,应该这样做
from bs4 import BeautifulSoup
import requests
url='https://www.publicmutual.com.my/Our-Products/UT-Fund-Prices'
data=requests.get(url).text
soup = BeautifulSoup(data, 'lxml')
inputs=soup.select('input[type="hidden"]')
payload={}
for input_ in inputs:
if 'name' in input_.attrs and 'value' in input_.attrs:
payload[input_['name']]=input_['value']
table=requests.post(url, data=payload).text
好的我之前在这里问过这个问题 python scraping for javascript not working and specific data
而且我似乎可以从提取 xhr 内容中获取数据,在这种情况下,我可以有其他替代方法在不使用 selenium 的情况下进行这种抓取。
import scrapy
import json
class PublicMutual(scrapy.Spider):
name = 'publicmutual'
headers = {'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Cookie': '.ASPXANONYMOUS=u8UpT1xTjt54Tf80JCsS2GqJWf4sPIksbzi5JOaw8TsM7i64n54q8yESMrdk81uj2hjiaMMLSMJAl0LcevRrYNP0XoGlGcGMpgNnmpG6YSMM1jAK0; Analytics_VisitorId=42ce4acb-6501-4828-aa81-74ef126af235; Analytics=SessionId=fc660efe-9e82-4379-afd5-8f77f203ff10&TabId=106&ContentItemId=-1; dnn_IsMobile=False; language=en-US; ASP.NET_SessionId=da1cbmzdgrzitjwntnlu3ioq; __RequestVerificationToken=Ry8wSKybT77XgBmmxuOfGmM4a6_Wy-B1MNKrN5g2zfVB1c6GXlL68ZYWUwZKBvVjyheTWQ2',
'Host': 'www.publicmutual.com.my',
'Referer': 'https://www.publicmutual.com.my/Our-Products/UT-Fund-Prices',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'}
def start_requests(self):
yield scrapy.Request(url='https://www.publicmutual.com.my/Our-Products/UT-Fund-Prices',headers= self.headers,callback=self.parse)
def parse(self, response):
print(json.loads(response.body))
这就是我用作基础的内容,在 运行 这段代码之后我根本没有得到任何输出。我不确定我在这里做错了什么。请帮助
好的,我想我成功了。页面中有一个通过 xhr (...) 自动发送的表单。所以我们只是抓住他们的输入来伪造一个有效载荷,应该这样做
from bs4 import BeautifulSoup
import requests
url='https://www.publicmutual.com.my/Our-Products/UT-Fund-Prices'
data=requests.get(url).text
soup = BeautifulSoup(data, 'lxml')
inputs=soup.select('input[type="hidden"]')
payload={}
for input_ in inputs:
if 'name' in input_.attrs and 'value' in input_.attrs:
payload[input_['name']]=input_['value']
table=requests.post(url, data=payload).text