具有快速更新选项的超大型数据集在 Azure 中的最佳存储

Best storage in Azure for very large data sets with fast update options

我们的驱动器上有大约 3 亿份文档。我们需要删除大约 2 亿个。我打算将 2 亿条路径写入存储,以便我可以跟踪已删除的文档。我目前的想法是 Azure SQL 数据库不太适合这个数量。 Cosmos DB 太贵了。存储 csv 文件不好,因为我每次删除文件时都需要进行更新。 Table 存储似乎是一个很好的匹配,但不提供在进行状态报告时可以派上用场的操作分组。我不太了解数据湖,如果你能做快速更新或者它更像是一个存档。欢迎大家为此类报告选择合适的存储空间。

提前致谢。

根据您的需要,您可以使用 Azure Cosmos DB 或 Azure table 存储。

A​​zure Table 存储为半结构化数据提供无SQL 键值存储。与传统的关系数据库不同,每个实体(例如关系数据库术语中的一行)都可以有不同的结构,允许您的应用程序在不停机的情况下在模式之间迁移。

A​​zure Cosmos DB 是一种多模式数据库服务,专为在全球范围内用于关键任务系统而设计。它不仅公开了 Table API,它还有 SQL API、Apache Cassandra、MongoDB、Gremlin 和 Azure Table 存储.这些使您可以轻松地用 Cosmos DB 实现替换现有的数据库。

Their differences are as below:

Performance

Azure Table Storage has no upper bound on latency. Cosmos DB defines latency of single-digit milliseconds for reads and writes along with operations at sub-15 milliseconds at the 99th percentile worldwide. (That was a mouthful) Throughput is limited on Table Storage to 20,000 operations per second. On Cosmos DB, there is no upper limit on throughput, and more than 10 million operations per second are supported.

Global Distribution

Azure Table Storage supports a single region with an optional read-only secondary region for availability. Cosmos DB supports distribution from 1 to more than 30 regions with automatic failovers worldwide.

Billing

Azure Table Storage uses storage volume to determine billing.Pricing is tiered to get progressively cheaper per GB the more storage you use. Operations incur a charge measured per 10,000 transactions.

For Cosmos DB, It has tow billing nodule : Provisioned throughput and Consumed Storage.

  • Provisioned Throughput: Provisioned throughput (also called reserved throughput) guarantees high performance at any scale. You specify the throughput (RU/s) that you need, and Azure Cosmos DB dedicates the resources required to guarantee the configured throughput. You are billed hourly for the maximum provisioned throughput for a given hour.
  • Consumed Storage: You are billed a flat rate for the total amount of storage (GBs) consumed for data and the indexes for a given hour.

详情请参考document