Scrapy:信息仅在展开元素后在文本中可见
Scrapy: info only visible in text after expanding element
我刚开始尝试从 http://brokercheck.finra.org
中抓取和获取经纪人的名字。我要的html内容只有展开一些节点后才能看到
import requests
import scrapy
url = brokercheck.finra.org/individual/summary/2713535
r = requests.get(url)
r.text
我想获取经纪人的名字,但它在标签 //div[@class="namesummary"]
中。但是,r.text
没有捕捉到这种程度。这是网站的一项功能,或者我可以做些事情让 r.text
首先获取所有展开的节点?[=15=]
当您在浏览器中呈现页面时,另一个请求会发送到 https://api.brokercheck.finra.org/search/individual/2713535?hl=true&nrows=12&query=&start=0&wt=json
。
您可以在该请求中找到名称。用 json 解析内容。 Scrapy 无法处理这样发送的动态请求。
编辑
我做了一个提取名字的测试蜘蛛,请看下面:
import json
import scrapy
class TestSpider(scrapy.Spider):
name = 'testspider'
def start_requests(self):
broker_ids = ['2713535', '1234456', '2134234']
for broker_id in broker_ids:
yield scrapy.Request(
url=f'https://brokercheck.finra.org/individual/summary/{broker_id}',
meta={
'broker_id': broker_id
}
)
def parse(self, response):
broker_id = response.meta.get('broker_id')
# extract whatever content from the page you want here and save in broker item
# then sends another request to get the name.
broker = { }
yield scrapy.Request(
url=f'https://api.brokercheck.finra.org/search/individual/{broker_id}?hl=true&nrows=12&query=&start=0&wt=json',
callback=self.parse_name,
meta={
'broker': broker
}
)
def parse_name(self, response):
broker = response.meta.get('broker')
json_data = json.loads(response.body)
try:
content = json_data['hits']['hits'][0]['_source']['content']
content = json.loads(content)
broker['first_name'] = content['basicInformation']['firstName']
broker['middle_name'] = content['basicInformation']['middleName']
broker['last_name'] = content['basicInformation']['lastName']
except IndexError:
print('something went wrong, could not find broker with that id')
print(broker)
我刚开始尝试从 http://brokercheck.finra.org
中抓取和获取经纪人的名字。我要的html内容只有展开一些节点后才能看到
import requests
import scrapy
url = brokercheck.finra.org/individual/summary/2713535
r = requests.get(url)
r.text
我想获取经纪人的名字,但它在标签 //div[@class="namesummary"]
中。但是,r.text
没有捕捉到这种程度。这是网站的一项功能,或者我可以做些事情让 r.text
首先获取所有展开的节点?[=15=]
当您在浏览器中呈现页面时,另一个请求会发送到 https://api.brokercheck.finra.org/search/individual/2713535?hl=true&nrows=12&query=&start=0&wt=json
。
您可以在该请求中找到名称。用 json 解析内容。 Scrapy 无法处理这样发送的动态请求。
编辑
我做了一个提取名字的测试蜘蛛,请看下面:
import json
import scrapy
class TestSpider(scrapy.Spider):
name = 'testspider'
def start_requests(self):
broker_ids = ['2713535', '1234456', '2134234']
for broker_id in broker_ids:
yield scrapy.Request(
url=f'https://brokercheck.finra.org/individual/summary/{broker_id}',
meta={
'broker_id': broker_id
}
)
def parse(self, response):
broker_id = response.meta.get('broker_id')
# extract whatever content from the page you want here and save in broker item
# then sends another request to get the name.
broker = { }
yield scrapy.Request(
url=f'https://api.brokercheck.finra.org/search/individual/{broker_id}?hl=true&nrows=12&query=&start=0&wt=json',
callback=self.parse_name,
meta={
'broker': broker
}
)
def parse_name(self, response):
broker = response.meta.get('broker')
json_data = json.loads(response.body)
try:
content = json_data['hits']['hits'][0]['_source']['content']
content = json.loads(content)
broker['first_name'] = content['basicInformation']['firstName']
broker['middle_name'] = content['basicInformation']['middleName']
broker['last_name'] = content['basicInformation']['lastName']
except IndexError:
print('something went wrong, could not find broker with that id')
print(broker)