Cosmos 分区键上的 STARTSWITH 是否优化了 "fan-out" 的跨分区查询?
Does STARTSWITH on Cosmos partition keys optimize "fan-out" of cross-partition queries?
Microsoft 明确表示跨分区查询将查询“扇出”到每个分区 (link):
The following query doesn't have a filter on the partition key (DeviceId). Therefore, it must fan-out to all physical partitions where it is run against each partition's index:
所以我很好奇是否可以通过对分区键(例如 STARTSWITH)执行 运行ge 查询来优化“扇出”。
为了测试它,我创建了一个包含七个文档的小型 Cosmos DB:
{
"partitionKey": "prefix1:",
"id": "item1a"
},
{
"partitionKey": "prefix1:",
"id": "item1b"
},
{
"partitionKey": "prefix1:",
"id": "item1c"
},
{
"partitionKey": "prefix1X:",
"id": "item1d"
},
{
"partitionKey": "prefix2:",
"id": "item2a"
},
{
"partitionKey": "prefix2:",
"id": "item2b"
},
{
"partitionKey": "prefix3:",
"id": "item3a"
}
它具有分区键“/partitionKey”的默认索引策略。然后我运行一堆查询:
SELECT * FROM c WHERE STARTSWITH(c.partitionKey, 'prefix1')
-- Actual Request Charge: 2.92 RUs
SELECT * FROM c WHERE c.partitionKey = 'prefix1:' OR c.partitionKey = 'prefix1X:'
-- Actual Request Charge: 3.02 RUs
SELECT * FROM c WHERE STARTSWITH(c.partitionKey, 'prefix1:')
SELECT * FROM c WHERE c.partitionKey = 'prefix1:'
-- Each Query Has Actual Request Charge: 2.89 RUs
SELECT * FROM c WHERE STARTSWITH(c.partitionKey, 'prefix2')
SELECT * FROM c WHERE c.partitionKey = 'prefix2:'
-- Each Query Has Actual Request Charge: 2.86 RUs
SELECT * FROM c WHERE STARTSWITH(c.partitionKey, 'prefix3')
SELECT * FROM c WHERE c.partitionKey = 'prefix3:'
-- Each Query Has Actual Request Charge: 2.83 RUs
SELECT * FROM c WHERE c.partitionKey = 'prefix2:' OR c.partitionKey = 'prefix3:'
-- Actual Request Charge: 2.99 RUs
重新运行查询时,请求费用是一致的。费用增长的模式似乎与结果集和查询复杂性一致,可能 'OR' 查询除外。但是,然后我尝试了这个:
SELECT * FROM c
-- Actual Request Charge: 2.35 RUs
所有分区的基本扇出甚至比针对特定分区更快,即使使用相等运算符也是如此。我不明白这是怎么回事。
综上所述,我的示例数据库非常小,只有七个文档。查询集可能不够大,无法信任结果。
那么,如果我有数百万个文档,STARTSWITH(c.partitionKey, 'prefix') 会比分散到所有分区更优化吗?
docs表明有一些效率
With Azure Cosmos DB, typically queries perform in the following order from fastest/most efficient to slower/less efficient.
- GET on a single partition key and item key
- Query with a filter clause on a single partition key
- Query without an equality or range filter clause on any property
- Query without filters
随着规模的扩大,“logical partitions" per "physical partition”会越来越少,直到最终每个分区键值都有自己的物理分区。
所以:
if I had millions of documents, would STARTSWITH(c.partitionKey, 'prefix') be more optimized than fanning out to all partitions?
两个查询都将跨多个分区展开。
而且我很确定,由于“Azure Cosmos DB 使用基于散列的分区将逻辑分区分布到物理分区”,具有公共前缀的分区键之间没有局部性,每个 STARTSWITH 查询都必须扇形-out 跨所有物理分区。
我自己试图确定这种方法是否有任何好处,但根据答案似乎没有。
我刚刚了解了私人预览版中的新分层分区键功能,它似乎解决了我们正在努力解决的问题:
https://devblogs.microsoft.com/cosmosdb/hierarchical-partition-keys-private-preview/
Hierarchical partition keys are now available in private preview for
the Azure Cosmos DB Core (SQL) API. With hierarchical partition keys,
also known as sub-partitioning, you can now natively partition your
container with up to three levels of partition keys. This enables more
optimal partitioning strategies for multi-tenant scenarios or
workloads that would otherwise use synthetic partition keys. Instead
of having to choose a single partition key – which often leads to
performance trade-offs – you can now use up to three keys to further
sub-partition your data, enabling more optimal data distribution and
higher scale.
因为这允许最多 3 个键,所以它可以通过将前缀分解为单独的键来解决问题,或者如果有超过 3 个,至少进一步优化它。
例子
(来自 link 的用法示例):
https://github.com/AzureCosmosDB/HierarchicalPartitionKeysFeedbackGroup#net-v3-sdk-2
// Get the full partition key path
var id = "0a70accf-ec5d-4c2b-99a7-af6e2ea33d3d";
var fullPartitionkeyPath = new PartitionKeyBuilder()
.Add("Contoso") //TenantId
.Add("Alice") //UserId
.Build();
var itemResponse = await containerSubpartitionByTenantId_UserId.ReadItemAsync<dynamic>(id, fullPartitionkeyPath);
注意事项
根据预览 link 看来您需要选择加入预览并创建一个新容器
New containers only – all keys must be specified upon container
creation
Microsoft 明确表示跨分区查询将查询“扇出”到每个分区 (link):
The following query doesn't have a filter on the partition key (DeviceId). Therefore, it must fan-out to all physical partitions where it is run against each partition's index:
所以我很好奇是否可以通过对分区键(例如 STARTSWITH)执行 运行ge 查询来优化“扇出”。
为了测试它,我创建了一个包含七个文档的小型 Cosmos DB:
{
"partitionKey": "prefix1:",
"id": "item1a"
},
{
"partitionKey": "prefix1:",
"id": "item1b"
},
{
"partitionKey": "prefix1:",
"id": "item1c"
},
{
"partitionKey": "prefix1X:",
"id": "item1d"
},
{
"partitionKey": "prefix2:",
"id": "item2a"
},
{
"partitionKey": "prefix2:",
"id": "item2b"
},
{
"partitionKey": "prefix3:",
"id": "item3a"
}
它具有分区键“/partitionKey”的默认索引策略。然后我运行一堆查询:
SELECT * FROM c WHERE STARTSWITH(c.partitionKey, 'prefix1')
-- Actual Request Charge: 2.92 RUs
SELECT * FROM c WHERE c.partitionKey = 'prefix1:' OR c.partitionKey = 'prefix1X:'
-- Actual Request Charge: 3.02 RUs
SELECT * FROM c WHERE STARTSWITH(c.partitionKey, 'prefix1:')
SELECT * FROM c WHERE c.partitionKey = 'prefix1:'
-- Each Query Has Actual Request Charge: 2.89 RUs
SELECT * FROM c WHERE STARTSWITH(c.partitionKey, 'prefix2')
SELECT * FROM c WHERE c.partitionKey = 'prefix2:'
-- Each Query Has Actual Request Charge: 2.86 RUs
SELECT * FROM c WHERE STARTSWITH(c.partitionKey, 'prefix3')
SELECT * FROM c WHERE c.partitionKey = 'prefix3:'
-- Each Query Has Actual Request Charge: 2.83 RUs
SELECT * FROM c WHERE c.partitionKey = 'prefix2:' OR c.partitionKey = 'prefix3:'
-- Actual Request Charge: 2.99 RUs
重新运行查询时,请求费用是一致的。费用增长的模式似乎与结果集和查询复杂性一致,可能 'OR' 查询除外。但是,然后我尝试了这个:
SELECT * FROM c
-- Actual Request Charge: 2.35 RUs
所有分区的基本扇出甚至比针对特定分区更快,即使使用相等运算符也是如此。我不明白这是怎么回事。
综上所述,我的示例数据库非常小,只有七个文档。查询集可能不够大,无法信任结果。
那么,如果我有数百万个文档,STARTSWITH(c.partitionKey, 'prefix') 会比分散到所有分区更优化吗?
docs表明有一些效率
With Azure Cosmos DB, typically queries perform in the following order from fastest/most efficient to slower/less efficient.
- GET on a single partition key and item key
- Query with a filter clause on a single partition key
- Query without an equality or range filter clause on any property
- Query without filters
随着规模的扩大,“logical partitions" per "physical partition”会越来越少,直到最终每个分区键值都有自己的物理分区。
所以:
if I had millions of documents, would STARTSWITH(c.partitionKey, 'prefix') be more optimized than fanning out to all partitions?
两个查询都将跨多个分区展开。
而且我很确定,由于“Azure Cosmos DB 使用基于散列的分区将逻辑分区分布到物理分区”,具有公共前缀的分区键之间没有局部性,每个 STARTSWITH 查询都必须扇形-out 跨所有物理分区。
我自己试图确定这种方法是否有任何好处,但根据答案似乎没有。
我刚刚了解了私人预览版中的新分层分区键功能,它似乎解决了我们正在努力解决的问题:
https://devblogs.microsoft.com/cosmosdb/hierarchical-partition-keys-private-preview/
Hierarchical partition keys are now available in private preview for the Azure Cosmos DB Core (SQL) API. With hierarchical partition keys, also known as sub-partitioning, you can now natively partition your container with up to three levels of partition keys. This enables more optimal partitioning strategies for multi-tenant scenarios or workloads that would otherwise use synthetic partition keys. Instead of having to choose a single partition key – which often leads to performance trade-offs – you can now use up to three keys to further sub-partition your data, enabling more optimal data distribution and higher scale.
因为这允许最多 3 个键,所以它可以通过将前缀分解为单独的键来解决问题,或者如果有超过 3 个,至少进一步优化它。
例子 (来自 link 的用法示例): https://github.com/AzureCosmosDB/HierarchicalPartitionKeysFeedbackGroup#net-v3-sdk-2
// Get the full partition key path
var id = "0a70accf-ec5d-4c2b-99a7-af6e2ea33d3d";
var fullPartitionkeyPath = new PartitionKeyBuilder()
.Add("Contoso") //TenantId
.Add("Alice") //UserId
.Build();
var itemResponse = await containerSubpartitionByTenantId_UserId.ReadItemAsync<dynamic>(id, fullPartitionkeyPath);
注意事项
根据预览 link 看来您需要选择加入预览并创建一个新容器
New containers only – all keys must be specified upon container creation