Nodejs/nestjs：从我的多个爬虫中获得 13 秒的响应时间

Question

我正在构建类似于 Flipboard 简报移动应用程序的报纸应用程序！使用 nodejs nestjs 框架。

所以我爬进了多个网站以获取数据，最后我得到了一个数组，其中仅针对从每个网站一起收集的第一页就有超过 60 个项目，响应时间在 10 秒到 15 秒之间仅 3 个网站不可接受!!!!

我搜索了一下，发现 nestjs 提供了缓存服务，缓存结果以 20 毫秒结尾，这很棒，但是！

I'm not using any type of DB as I'm not scraping data! just titles and URLs for iframes

我的问题是：

如何分页到每页 60 个项目，最后从我的爬虫程序发出对下一页的新请求。
第一个用户每6小时就会面临15s的响应时间（我的缓存结束）那么如何让服务器自动缓存数据不等待请求

Crawler 代码：（我有 3 个这样的函数，几乎相同，只是 CSS 选择器发生了变化）

async getArticlesFromTechWD(page: number) {
    const html = await get('https://www.tech-wd.com/wd/category/news/page/' + page);

    // Cheerio
    let $ = load(html);

    function formatingDate(date) {
        let months = ["يناير", "فبراير", "مارس", "إبريل", "مايو", "يونيو",
            "يوليو", "أغسطس", "سبتمبر", "أكتوبر", "نوفمبر", "ديسمبر"
        ];

        date = date.replace('،', '').split(' ');
        const year = date[2];
        const month = (months.indexOf(date[1]) + 1).toString().length == 1 ? '0' + (months.indexOf(date[1]) + 1) : (months.indexOf(date[1]) + 1)
        const day = date[0];

        return `${year}-${month}-${day}`;
    }

    const articles = $('#masonry-grid .post-element').map(function () {
        return {
            title: $('.thumb-title', this).text().trim(),
            date: formatingDate($('.date', this).text().trim()),
            url: $('.thumb-title a', this).attr('href'),
            image: $('.slide', this).css('background-image').replace('url(', '').replace(')', '').replace(/\"/gi, ""),
            src: 'www.tech-wd.com'
        }
    }).get();

    return articles;
}

将所有爬虫数据合并到一个数组中：

async getAllArticles(page: number, size: number) {

    const skip = size * (page - 1);

    // First crawler ( has an optional page pram default is page 1 )
    const unlimitTech = await this.getArticlesFromUnlimitTech();

    // Second crawler ( has an optional page pram default is page 1 )
    const tectWd = await this.getArticlesFromTechWD();

    // Merge them and sorted by date ( DESC )
    const all = unlimitTech.concat(tectWd).sort(() => Math.random() - 0.5);

    return all;

}

Answer 1

诀窍是一次做多件事。开始你所有的请求，然后在最后对每个请求 await。至少你的时间安排听起来像是在等待每个请求完成后再开始下一个请求。

Answer 2

而不是一次一个：

const unlimitTech = await this.getArticlesFromUnlimitTech();
const tectWd = await this.getArticlesFromTechWD();

您可以同时进行：

const [unlimitTech, tectWd] = await Promise.all([
  this.getArticlesFromUnlimitTech(),
  this.getArticlesFromTechWD()
])

Nodejs/nestjs：从我的多个爬虫中获得 13 秒的响应时间

Nodejs/nestjs: Getting a 13s response time from my multiple crawlers

javascript

web-crawler

node.js

web-scraping

cheerio