Scrapy:信息仅在展开元素后在文本中可见

Scrapy: info only visible in text after expanding element

我刚开始尝试从 http://brokercheck.finra.org 中抓取和获取经纪人的名字。我要的html内容只有展开一些节点后才能看到

import requests
import scrapy

url = brokercheck.finra.org/individual/summary/2713535
r = requests.get(url)
r.text

我想获取经纪人的名字,但它在标签 //div[@class="namesummary"] 中。但是,r.text 没有捕捉到这种程度。这是网站的一项功能,或者我可以做些事情让 r.text 首先获取所有展开的节点?[​​=15=]

当您在浏览器中呈现页面时,另一个请求会发送到 https://api.brokercheck.finra.org/search/individual/2713535?hl=true&nrows=12&query=&start=0&wt=json

您可以在该请求中找到名称。用 json 解析内容。 Scrapy 无法处理这样发送的动态请求。

编辑

我做了一个提取名字的测试蜘蛛,请看下面:

import json
import scrapy

class TestSpider(scrapy.Spider):
    name = 'testspider'

    def start_requests(self):
        broker_ids = ['2713535', '1234456', '2134234']
        for broker_id in broker_ids:
            yield scrapy.Request(
                url=f'https://brokercheck.finra.org/individual/summary/{broker_id}',
                meta={
                    'broker_id': broker_id
                }
            )

    def parse(self, response):
        broker_id = response.meta.get('broker_id')
        # extract whatever content from the page you want here and save in broker item
        # then sends another request to get the name.
        broker = { }
        yield scrapy.Request(
            url=f'https://api.brokercheck.finra.org/search/individual/{broker_id}?hl=true&nrows=12&query=&start=0&wt=json',
            callback=self.parse_name,
            meta={
                'broker': broker
            }
        )

    def parse_name(self, response):
        broker = response.meta.get('broker')
        json_data = json.loads(response.body)
        try:
            content = json_data['hits']['hits'][0]['_source']['content']
            content = json.loads(content)
            broker['first_name'] = content['basicInformation']['firstName']
            broker['middle_name'] = content['basicInformation']['middleName']
            broker['last_name'] = content['basicInformation']['lastName']
        except IndexError:
            print('something went wrong, could not find broker with that id')

        print(broker)