表单上的 Scrapy 请求 - url 被截断

Question

上下文 - 问题的复制

您好，我一直在尝试使用 Scrapy 在此处抓取此网页：https://www.sec.gov/edgar/search。

如您所见，这是一个带有要填写的表单的搜索页面。我想抓取这样填写表格时返回的页面：

'Document word or phrase': 'MSIGX',
'Filing category': 'All annual, quarterly, and current reports',
'Filed data range': 'Last year'

当您以这种方式插入这些字段时，浏览器会将您重定向到此 link https://www.sec.gov/edgar/search/#/q=MSIGX&dateRange=1y&category=form-cat1。一开始我以为link里的是参数，后来才发现link里没有问号。但是，如果我执行下面的代码，回调函数 (parse) 的响应中的 url 将被截断为仅 https://www.sec.gov/edgar/search 而不是 https://www.sec.gov/edgar/search/#/q=MSIGX&dateRange=1y&category=form-cat1.

class Sec(scrapy.Spider):
    name = 'sec'

    def __init__(self):
        super().__init__()
        # unrelevant stuff
        pass

    
    def start_requests(self):
        today = datetime.today()
        year_ago = datetime.today().replace(year=today.year-1)
        url = f'https://www.sec.gov/edgar/search/#/q=MSIGX&dateRange=1y&category=form-cat1'
        yield scrapy.Request(url=url, 
                                    headers=get_sec_header(), 
                                    callback=self.parse, 
                                    meta={'s':row["symbol"], 
                                        'dont_redirect': True,
                                        'handle_httpstatus_list': [301, 302]
                                    }, dont_filter=True)
    def parse(self, response):
        print(f'HEEEEEEEELLO {response.url}') # <----- the url is only https://www.sec.gov/edgar/search

因此我无法抓取我想要的页面。

调试

我对此进行了进一步调试，当我在我的网络浏览器上访问 https://www.sec.gov/edgar/search/#/q=MSIGX&dateRange=1y&category=form-cat1 时，我看到正在进行以下调用：如您所见，我假设有一些后续的 POST 调用会加载一些 JS 并呈现搜索到的项目。

如何让 Scrapy 访问最终网页？？？

Answer 1

要呈现这些 JS 函数的结果，您需要使用像 selenium 或 Splash 这样的浏览器模拟器。

Splash 与 scrapy 集成得很好，一旦你开始使用它运行你可以做这样的事情：

import scrapy
from scrapy_splash import SplashRequest


class Sec(scrapy.Spider):
    name = 'sec'

    def start_requests(self):
        url = f'https://www.sec.gov/edgar/search/#/q=MSIGX&dateRange=1y&category=form-cat1'
        yield SplashRequest(url, callback = self.parse, endpoint='render.html', args = {'wait': 1})
    
    def parse(self, response):
        with open('response.html', 'w') as outfile: 
            outfile.write(response.text)
        pass

打开那个 response.html 文件，看看启动引擎做了什么。您现在可以看到备案实体。

在 settings.py 中，您还需要至少添加以下内容：

SPLASH_URL = 'http://0.0.0.0:8050' # assuming you are running locally
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

用于安装 splash you can see this documentation

有关 scrapy-splash 的信息，您可以阅读此 Github repo

表单上的 Scrapy 请求 - url 被截断

Scrapy Request on form - url gets truncated

python

scrapy

上下文 - 问题的复制

调试