如何运行具有 ASP 核心、服务交互和生命周期问题的爬虫

Question

我使用 ASP Core 3.1 App 开发了一个网络爬虫。

遵循我在 Internet 上收集的良好做法，我拆分了 下载程序服务 ，负责请求网页并在下载到数据库后将其存储，网页整合服务，负责获取原始HTML网页并将它们整合成有用的数据。

由于要抓取的页面依赖于先前抓取的网页，因此这两个服务之间的通信通过数据库进行两种方式，实现了良好的服务解耦：

网页整合服务填写URL待抓取到DB，收集下载HTML
下载器服务从数据库收集URL并在下载完成后填充HTML

我面临多项技术挑战，由于各种原因我觉得我的设计不是最优的。

下载器和网页合并服务都注册为服务容器。根本原因是，尽管整个应用程序被设计为 API（启动爬虫、停止爬虫、获取一些爬取的数据），但这些服务运行在后台的寿命要长得多比 API 请求甚至会话的时间都要多。我知道单例模式会导致问题，但我对运行爬虫没有更好的想法。 我应该预料到什么问题，是否有更适合设计这些服务的方法？
为了实现continous操作，两个服务都是作为非等待异步操作启动的，并且运行s是无限循环查询数据库。我在这个设计中面临的主要问题是万一 在此过程中出现任何异常，例如下载失败 异常不会再冒泡到调用方法（有none) 并且异常可能是运行不确定。

我相信关于应用程序的设计方式有很多不好的地方，请宽容并指出正确的资源（如果有的话）。我不确定这个 post 是否符合论坛规则（太宽泛的问题？），如果不符合请删除它。

这是爬虫的简化版本：

public class APIController : Controller //API Controller starting and stopping the crawler
{
    private DownloaderService _downloaderService;
    private WebpageConsolidationService _consolidationService;

    public APIController(DownloaderService downloaderService, WebpageConsolidationService consolidationService) {
        _downloaderService = downloaderService;
        _consolidationService = consolidationService;
    }


    public IActionResult StartCrawler() {
        if (!_downloaderService.DownloaderStarted) {
            Task t1 = _downloaderService.StartDownloaderAsync(); //non awaited task
        }
        if (!_consolidationService.ConsolidationStarted) {
            Task t2 = _consolidationService.StartWebpageConsolidationAsync(); //non awaited task
        }
        return Ok();
    }

    public IActionResult StopCrawler() {
        if (_downloaderService.DownloaderStarted) {
            _downloaderService.DownloaderStarted = false;
        }
        if (_consolidationService.ConsolidationStarted) {
            _consolidationService.ConsolidationStarted = false;
        }
        return Ok();
    }

}


public class DownloaderService //Singleton
{
    private ApplicationDbContext _context;
    private readonly IServiceScopeFactory scopeFactory;

    public DownloaderService(ApplicationDbContext context, IServiceScopeFactory scopeFactory)
    {
        _context = context;
        this.scopeFactory = scopeFactory;
    }

    public bool DownloaderStarted { get; set; }

    public async Task StartDownloaderAsync()
    {
        DownloaderStarted = true;
        while (DownloaderStarted)
        {
            using (var scope = scopeFactory.CreateScope()) {
                var context = scope.ServiceProvider.GetRequiredService<ApplicationDbContext>();

                string url = context.Webpages.FirstOrDefault(x => x.Downloaded == false)?.Url;
                if(url==null) continue;

                //Download the webpage here
                Webpage webpage = await DowloadWebpageAsync(url);
                webpage.Downloaded = true;
                context.Webpages.Add(webpage);
                await context.SaveChangesAsync();

                if (context.Webpages.Any(x => x.Downloaded == false)) await Task.Delay(10000); //in case there is no more webpage to crawl now
            }
        }
    }
}

public class WebpageConsolidationService //Singleton
{
    private ApplicationDbContext _context;
    private readonly IServiceScopeFactory scopeFactory;

    public bool ConsolidationStarted { get; set; }

    public WebpageConsolidationService(ApplicationDbContext context, IServiceScopeFactory scopeFactory) {
        _context = context;
        this.scopeFactory = scopeFactory;
    }

    public async Task StartWebpageConsolidationAsync()
    {
        ConsolidationStarted = true;
        while (ConsolidationStarted) {
            using (var scope = scopeFactory.CreateScope()) {
                var context = scope.ServiceProvider.GetRequiredService<ApplicationDbContext>();

                Webpage toBeProcessed = context.Webpages.FirstOrDefault(x => x.Processed == false && x.Downloaded == true);
                if (toBeProcessed == null) continue;

                //Consolidate the webpage here
                Webpage[] otherWebpages = await ProcessWebpage(toBeProcessed);
            context.Webpages.AddRange(otherWebpages);

            await context.SaveChangesAsync();

                if (context.Webpages.Any(x => x.Processed == false && x.Downloaded==true)) await Task.Delay(10000); //in case there is no more webpage to crawl now
            }
        }
    }
}

Answer 1

爬虫应该运行作为通用主机而不是 Web 主机（asp.net 核心应用程序）。换句话说，你的下载服务和网页合并服务应该是2个不同的.NET核心应用程序，并且它们之间的通信应该使用消息队列或其他跨进程通信方式。

对于你提到的2ed问题，每次下载应该是一个单独的线程或任务，并且线程或任务应该从池中取出，这样你就不会消耗太多内存。下载器异常时写入日志

实际上没有关于如何在.net core 中编写爬虫的官方文档，因为这更像是一个个人实验室项目，我在这里分享我的 own 爬虫框架，以便您了解一些想法如何编写分布式爬虫（带有单独的下载器、存储服务等）。当然这个不是很完美，这个框架的功能比你的爬虫复杂一点（我基本上是抄袭Scrapy的思路），而且是基于.net core 2.0写的，所以不能利用最新的特性.net 核心 3.1。但是我相信你仍然可以从中受益。

如何运行具有 ASP 核心、服务交互和生命周期问题的爬虫

How to run a crawler with ASP Core, service interaction and lifetime question

c#

web-crawler

asp.net-core

如何 运行 具有 ASP 核心、服务交互和生命周期问题的爬虫

How to run a crawler with ASP Core, service interaction and lifetime question

c#

web-crawler

asp.net-core

如何运行具有 ASP 核心、服务交互和生命周期问题的爬虫