为什么 Splash+Scrapy 添加 html header 到 json 响应

Question

我错过了什么？

我正在尝试抓取一些 json 但我一直收到此 html header 和 json 响应：

response.data['html'] return:

2021-02-18 10:35:57 [bcb] DEBUG: b'<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{"TotalRows":132,"RowCount":15,"Rows":[{"tit`....

代码如下：

    yield scrapy.Request(address_pesquisa, self.parse, meta={
            'splash': {
                'args': {
                    # set rendering arguments here
                    'html': 1,
                    'png': 0,

                },

                # optional parameters
                'endpoint': 'render.json',  # optional; default is render.json
                'splash_url': 'http://192.168.15.100:8050',  # optional; overrides SPLASH_URL
                'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN,
                'splash_headers': {},  # optional; a dict with headers sent to Splash
                'dont_process_response': False,  # optional, default is False
                'dont_send_headers': True,  # optional, default is False
                'magic_response': True,  # optional, default is True
            }
        })

我必须自己用一些正则表达式或什么来删除这个 header？或者我的 scrapy 配置错误？

Answer 1

在 HTML 中提取 JSON 的直接选项是使用 XPath（或 CSS 选择器）。 Here's the documentation for Scrapy Selectors.

scrapy.Request 回调函数中的类似内容 (self.parse)

json_response = response.xpath('html/body/pre/text()').get()
json_response = json.loads(json_response)

请注意，我没有测试代码，因此您可能需要稍微更改一下（如果我打错了 XPath 或其他内容）。

此外，您可能想尝试使用 curl 或 Scrapy shell 下载页面并检查 HTML 部分是否仍在响应中。如果不是，以某种方式使用 Splash 可能会使网站 return 具有 HTML.

的响应

关于为什么 HTML 在使用 curl 时不在响应中的更新：

一种可能是 Web 服务器 return 在使用浏览器时与使用 curl 时的响应不同。这样做的一个原因是为了让 JSON 对于使用浏览器的用户来说更具可读性。我的意思是，尝试通读 JSON 格式正确时会更容易，而不仅仅是一行中的所有内容：D

所以，如果是这种情况，我的猜测是 Splash 将一些数据传递给服务器（即用户代理，能够呈现 JavaScript），这使得服务器 return HTML.

的响应

跳过 Splash 并仅使用 Scrapy Request 来发出请求可能会有所帮助（并且还会使爬虫更快一点）。

无论如何，如果 XPath 有效（并且唯一可能的小速度提升无关紧要），请使用 XPath。

为什么 Splash+Scrapy 添加 html header 到 json 响应

Why Splash+Scrapy add html header to json response

python

scrapy

scrapy-splash