运行作为脚本的 Scrapy spider 没有获取所有代码，但是项目中的 scrapy spider 可以

Question

我有一个简单的蜘蛛，它可以从页面上的脚本中抓取一些东西。

我这样抓取脚本

jsData = json.loads(response.xpath('//script[@type="application/ld+json"]//text()').extract_first())

当我运行在我的蜘蛛中从一个项目中获取这个时，我得到了所有数据，但是如果我运行从一个常规脚本中获取它，而不是在一个项目中，它就不会获取脚本中的一切。这是为什么？

这是我的脚本蜘蛛

import scrapy
import json
from scrapy.crawler import CrawlerProcess


class MySpider(scrapy.Spider):
    name = "target"
    start_urls = ['https://www.target.com/p/madden-nfl-22-xbox-one-series-x/-/A-83744898#lnk=sametab']

    def parse(self, response):
        jsData = json.loads(response.xpath('//script[@type="application/ld+json"]//text()').extract_first())
        NAME_SELECTOR = jsData['@graph'][0]

        yield {
            'name': NAME_SELECTOR,
        }


process = CrawlerProcess()

process.crawl(MySpider)
process.start()

它给了我

...'offers': {'@type': 'Offer', 'priceCurrency': 'USD', 'availability': 'InStock', 'availableDeliveryMethod': 'ParcelService', 'potentialAction': {'@type': 'BuyAction'}, 'url': 'https://www.target.com/p/madden-nfl-22-xbox-one-series-x/-/A-83744898'}}}

我的项目爬虫代码是

import scrapy
import json

class targetSpider(scrapy.Spider):
    name = "target"
    start_urls = ['https://www.target.com/p/madden-nfl-22-xbox-one-series-x/-/A-83744898#lnk=sametab']

    def parse(self, response):
        jsData = json.loads(response.xpath('//script[@type="application/ld+json"]//text()').extract_first())
        test = jsData['@graph'][0]

        yield {
            'test': test
        }

它给了我

...'offers': {'@type': 'Offer', 'price': '59.99', 'priceCurrency': 'USD', 'availability': 'PreOrder', 'availableDeliveryMethod': 'ParcelService', 'potentialAction': {'@type': 'BuyAction'}, 'url': 'https://www.target.com/p/madden-nfl-22-xbox-one-series-x/-/A-8
3744898'}}}

Answer 1

大约 javascript。 'price': '59.99' 等内容由 javascript 加载。而Scrapy中的Downloader默认不支持运行javascript

您的问题的可能原因

您的一个蜘蛛 settings.py 启用了一些外部下载器中间件（例如 Selenium、Splash、Playwright），而另一个没有。
以CrawlerProcess()开头的爬虫脚本不在项目根目录运行下，导致settings.py加载失败

更新：抱歉，我忘了我们在使用CrawlerProcess()时需要手动加载设置。 Run scrapy from a script.

运行作为脚本的 Scrapy spider 没有获取所有代码，但是项目中的 scrapy spider 可以

Running Scrapy spider as script doesn't get all the code, but scrapy spider from project does

javascript

python

scrapy

运行 作为脚本的 Scrapy spider 没有获取所有代码，但是项目中的 scrapy spider 可以

Running Scrapy spider as script doesn't get all the code, but scrapy spider from project does

javascript

python

scrapy

运行作为脚本的 Scrapy spider 没有获取所有代码，但是项目中的 scrapy spider 可以