如何使用 Splash 抓取 JS 页面？

Question

我正在尝试抓取这个 link 但我没有成功，我没有得到任何错误，我的值变成空白。

我正在使用 python scrapy 和 splash。怎么了？有人帮帮我吗？

这是我的蜘蛛代码：

  # -*- coding: utf-8 -*-
  import scrapy
  from scrapy_splash import SplashRequest
  from boom.items import BoomItem
  from scrapy.selector import HtmlXPathSelector


  class OrumcekSpider(scrapy.Spider):
        name = 'orumcek'
        start_urls = ['example.com']

def start_requests(self):
    for url in self.start_urls:
      yield SplashRequest(url=url, callback=self.parse, endpoint='render.html')

def parse(self, response):
        item = BoomItem()
        item["BrandName"] = response.xpath("//*[@id='data-item']/div/a/span/text()").extract()
        item["BrandSector"] = response.xpath("//*[@id='data-item']/div[3]/span/text()").extract()

        return item

Answer 1

我在页面上找不到 id 等于 data-item 的任何元素，无论是在页面源代码中还是在检查它时。但是，有些元素具有属性 data-item。所以使用 Splash 渲染可能没有问题，你只需要将你的 XPath 修改为

item["..."] = response.xpath("//*[@data-item]/...")

Answer 2

你有 data-item 但它不是 id，在图中我向你展示了如何复制 selector 或 xpath

此页面的呈现需要时间，您应该等到找到您想要的元素。

  while not splash:select('.your-element') do
    splash:wait(0.1)
  end

如何使用 Splash 抓取 JS 页面？

How to Scraping JS pages with Splash?

python

splash-screen

scrapy