使用 Apify SDK 时是否可以指定最大爬网深度？

Question

我正在做一个项目，我正在评估 Scrapy 和 Apify。大多数代码都以 node.js 为中心，因此 javascript 解决方案会很好。此外，我喜欢在 Apify 中使用 puppeteer。也就是说，我的用例需要对许多网站进行相当浅的（例如深度大约为 4）抓取。这在 Scrapy 中很容易配置，但我不知道如何在 Apify 中进行配置。有没有办法在新的 Apify API 中指定最大深度？看起来这是他们旧版爬虫中的一个参数，但我在新的 API.

中没有找到它

Answer 1

您可以在 apify/web-scraper 中找到选项 "Max crawling depth"。这个工具是旧版 phantomJS scraper 的替代品。它使用 puppeteer 并且具有非常相似的界面。

您甚至可以使用 Apify SDK and implement max depth on your own using PuppeteerCrawler. I recommend using request.userData to log how deep you are in crawling. If you are interested in this solution you can check the source code of web scraper，它是如何在 web-scraper 中完成的。

Answer 2

您可以采用两种方法。首先，您可以使用 Puppeteer Scraper public actor, which enables you to use most of Apify SDK's features in a simplified form and the max crawl depth configuration is available there as a simple input under the Performance and limits section. To learn the basics, visit the introduction tutorial.

第二种方式比较复杂，直接使用Apify SDK。对于所有请求，您可以使用 request.userData 属性向下传递任意用户数据。这样，在将更多页面添加到抓取队列之前，您可以检查是否未达到所需的深度：

const MAX_DEPTH = 4;

// When creating the request queue, we seed the first request with a depth of 0.
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({
 url: "https://whosebug.com",
 userData: {
   depth: 0,
 }
});

// ...

// Then, somewhere in handlePageFunction, when adding more requests to the queue.
if (request.userData.depth < MAX_DEPTH) {
  await requestQueue.addRequest({
    url: "https://example.com",
    userData: {
      depth: request.userData.depth + 1,
  }
});

}

使用 Apify SDK 时是否可以指定最大爬网深度？

Is there a way to specify max crawl depth when using the Apify SDK?

web-crawler

apify