使用 Scrapy(美味的冰淇淋)绕过弹出窗口

Bypass popup using Scrapy (yummy ice-cream)

我正在尝试从网站上抓取与冰淇淋相关的数据,https://threetwinsicecream.com/products/ice-cream/。这似乎是一个非常简单的网站。但是,由于(我认为)阻碍我访问的 (JavaScript) 弹出窗口,我无法让我的蜘蛛工作。我在下面附上了我的 scrapy 代码的精简版:

class NutritionSpider(scrapy.Spider):
    name = 'nutrition'
    allowed_domains = ['threetwinsicecream.com']
    start_urls = ['http://threetwinsicecream.com/']

    def parse(self, response):
        products = response.xpath("//div[@id='pints']/div[2]/div")
        for product in products:
            name = product.xpath(".//a/p/text()").extract_first()
            link = product.xpath(".//a/@href").extract_first()

            yield scrapy.Request(
                url=link,
                callback=self.parse_products,
                meta={
                    "name": name,
                    "link": link
                }
            )

    def parse_products(self, response):
        name = response.meta["name"]
        link = response.meta["link"]

        serving_size = response.xpath("//div[@id='nutritionFacts']/ul/li[1]/text()").extract_first() 

        calories = response.xpath("//div[@id='nutritionFacts']/ul/li[2]/span/text()").extract_first()

        yield {
            "Name": name,
            "Link": link,
            "Serving Size": serving_size,
            "Calories": calories
        }

我设计了一个解决方法,但它需要手动写出指向各种冰淇淋品种的所有链接,如下所示。我也试过在网站上禁用 JavaScript,但这似乎也不起作用。

def parse(self, response):

        urls = [
            "https://threetwinsicecream.com/products/ice-cream/madagascar-vanilla/",
            "https://threetwinsicecream.com/products/ice-cream/sea-salted-caramel/",
            ...
        ]

        for url in urls:
            yield scrapy.Request(
                url=url,
                callback=self.parse_products
            )

def parse_products(self, response):
        pass

有没有办法使用 scrapy 绕过弹出窗口,或者我是否必须使用其他工具,如 selenium?感谢您的帮助!

您发布的蜘蛛有效。至少在我的机器上。我唯一需要更改的是 start_urls = ['http://threetwinsicecream.com/']start_urls = ['https://threetwinsicecream.com/products/ice-cream/']

如果你 运行 遇到这些类型的问题,你可以使用 Scrapys open_in_browser 功能,通过它你可以看到 Scrapy 在你的浏览器中看到的内容。已记录 here