试图确定为什么我的 xpath 在 Scrapy 中失败

Question

我正在尝试运行 Scrapy 蜘蛛在这样的页面上：

https://careers.mitre.org/us/en/job/R104514/Chief-Engineer-Technical-Analysis-Department

并且我希望蜘蛛检索具有资格和责任的要点。我可以写一个 xpath 表达式来得到它，它在我的浏览器中工作：

//*/section/div/ul/li

但是当我尝试使用 Scrapy shell:

response.xpath("//*/section/div/ul/li")

它returns 一个空列表。基于复制 response.text 并将其加载到浏览器中，似乎可以访问文本，但我仍然无法访问这些项目符号。

如有任何帮助，我们将不胜感激！

Answer 1

查看您链接的页面，您定位的列表项实际上并不在文档响应本身中，而是稍后由 JavaScript 加载到 DOM 中。

要访问这些内容，我建议您查看 Selecting dynamically-loaded content. The section that applies here in particuler is the Parsing JavaScript code 部分的 scrapy 文档。

接着第二个例子，我们可以使用chompjs（需要先用pip安装）提取JavaScript数据，对html字符串进行转义，然后加载到用于解析的 scrapy。例如：

scrapy shell https://careers.mitre.org/us/en/job/R104514/Chief-Engineer-Technical-Analysis-Department

然后：

import html    # Used to unescape the HTML stored in JS
import chompjs # Used to parse the JS
javascript = response.css('script::text').get()
data = chompjs.parse_js_object(javascript)
description_html = html.unescape(data['description'])
description = scrapy.Selector(text=description_html, type="html")
description.xpath("//*/ul/li")

这应该会输出您想要的列表项：

[<Selector xpath='//*/ul/li' data='<li>Ensure the strength ...

试图确定为什么我的 xpath 在 Scrapy 中失败

Trying to determine why my xpath is failing in Scrapy

html

xpath

scrapy