Scrapy Splash 不遵守渲染 "wait" 时间

Question

我正在使用 Scrapy 和 Splash 抓取此页面：https://www.athleteshop.nl/shimano-voor-as-108mm-37184

这是我在 Scrapy Shell 中用 view(response) 得到的图像： scrapy shell img

我需要以红色突出显示的条形码。但是它是在 javascript 中生成的，因为它可以在 Chrome 中的源代码中看到，使用 F12。然而，虽然在 Scrapy Shell 和 Splash localhost 中都正确显示，虽然 Splash localhost 给了我正确的 html，但我想要的条形码 select 总是等于 None 与 response.xpath("//table[@class='data-table']//tr[@class='even']/td[@class='data last']/text()").extract_first()。

selector 不是问题，因为它适用于 Chrome 的源代码。两天来我一直在网上和 SO 上寻找答案，似乎没有人遇到同样的问题。只是 Splash 不支持它吗？设置是 classic 设置如下：

SPLASH_URL = 'http://192.168.99.100:8050/'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 
810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

我的代码如下（解析部分针对点击网站内部搜索引擎提供的link，效果很好）：

    def parse(self, response):
        try :
            link=response.xpath("//li[@class='item last']/a/@href").extract_first()
            yield SplashRequest(link, self.parse_item, endpoint = 'render.html', args={'wait': 20})
        except Exception as e:
            print (str(e))


    def parse_item(self, response):
        product = {}
        product['name']=response.xpath("//div[@class='product-name']/h1/text()").extract_first()
        product['ean']=response.xpath("//table[@class='data-table']//tr[@class='even']/td[@class='data last']/text()").extract_first()
        product['price']=response.xpath("//div[@class='product-shop']//p[@class='special-price']/span[@class='price']/text()").extract_first()
        product['image']=response.xpath("//div[@class='item image-photo']//img[@class='owl-product-image']/@src").extract_first()
        print (product['name'])
        print (product['ean'])
        print (product['image'])

名字和图像 url 上的打印效果非常好，因为它们不是由 javascript 生成的。代码没问题，设置没问题，Splash localhost 向我展示了一些不错的东西，但是我的 selector 在脚本的执行中不起作用（没有显示任何错误），在 Scrapy Shell.

问题可能是 Scrapy Splash 立即呈现而不关心等待时间（20 秒！）请问我做错了什么？

提前致谢。

Answer 1

在我看来，条形码字段的内容是动态生成的，我可以在页面源代码中看到它并使用 response.css('.data-table tbody tr:nth-child(2) td:nth-child(2)::text').extract_first().[=11 从 scrapy shell 中提取=]

Scrapy Splash 不遵守渲染 "wait" 时间

Scrapy Splash not respecting Rendering "wait" time

javascript

render

splash-screen

wait

scrapy