聚类，在注入之前对数据进行排序是否会提高截断我的 table 的性能？

Question

我的目标是摄取数据，在特定列上排序，以便分区也按该顺序排列，以便高效地对该列进行修剪。

我想尽量减少分类成本，并希望获得一些关于我应该多久重新聚类的指导。

例如：

CREATE TABLE test_order(n NUMBER, s STRING);
INSERT INTO test_order 
VALUES 
   (12, 'a'), 
   (11, 'b'), 
   (10, 'c'), 
   (9, 'd'), 
   (8, 'e'), 
   (7, 'f'), 
   (6, 'g'), 
   (5, 'h'), 
   (6, 'i'), 
   (5, 'j'), 
   (4, 'k'), 
   (3, 'l'), 
   (2, 'm'), 
   (1, 'n');

SELECT * FROM test_order 
ORDER BY n ASC;

ALTER TABLE test_order CLUSTER BY (n, s);
ALTER TABLE test_order RECLUSTER;

SELECT n, s FROM test_order;
SELECT SYSTEM$CLUSTERING_INFORMATION('test_order', '(n,s)');

This was the information on the first insert:

{
  "cluster_by_keys" : "LINEAR(N, S)",
  "total_partition_count" : 1,
  "total_constant_partition_count" : 0,
  "average_overlaps" : 0.0,
  "average_depth" : 1.0,
  "partition_depth_histogram" : {
    "00000" : 0,
    "00001" : 1,
    "00002" : 0,
    "00003" : 0,
    "00004" : 0,
    "00005" : 0,
    "00006" : 0,
    "00007" : 0,
    "00008" : 0,
    "00009" : 0,
    "00010" : 0,
    "00011" : 0,
    "00012" : 0,
    "00013" : 0,
    "00014" : 0,
    "00015" : 0,
    "00016" : 0
  }
}

This was the information on the second insert:

INSERT INTO test_order 
VALUES 
   (12, 'p'), 
   (11, 'f'), 
   (10, 'z'), 
   (9, 'y'), 
   (8, 'x'), 
   (7, 'w'), 
   (6, 'v'), 
   (5, 'u'), 
   (6, 't'), 
   (5, 's'), 
   (4, 'r'), 
   (3, 'q'), 
   (2, 'p'), 
   (1, 'o');

{
  "cluster_by_keys" : "LINEAR(N, S)",
  "total_partition_count" : 2,
  "total_constant_partition_count" : 0,
  "average_overlaps" : 1.0,
  "average_depth" : 2.0,
  "partition_depth_histogram" : {
    "00000" : 0,
    "00001" : 0,
    "00002" : 2,
    "00003" : 0,
    "00004" : 0,
    "00005" : 0,
    "00006" : 0,
    "00007" : 0,
    "00008" : 0,
    "00009" : 0,
    "00010" : 0,
    "00011" : 0,
    "00012" : 0,
    "00013" : 0,
    "00014" : 0,
    "00015" : 0,
    "00016" : 0
  }
}

然后在第二次重新集群后：

{
  "cluster_by_keys" : "LINEAR(N, S)",
  "total_partition_count" : 2,
  "total_constant_partition_count" : 0,
  "average_overlaps" : 1.0,
  "average_depth" : 2.0,
  "partition_depth_histogram" : {
    "00000" : 0,
    "00001" : 0,
    "00002" : 2,
    "00003" : 0,
    "00004" : 0,
    "00005" : 0,
    "00006" : 0,
    "00007" : 0,
    "00008" : 0,
    "00009" : 0,
    "00010" : 0,
    "00011" : 0,
    "00012" : 0,
    "00013" : 0,
    "00014" : 0,
    "00015" : 0,
    "00016" : 0
  }
}

抱歉，我是格式化的新手，但插入特定顺序后，聚类比率没有太大变化 - 这是因为我的数据集示例太小，还是顺序对聚类无关紧要性能？

Answer 1

如果您要摄取排序的数据，我认为您不需要对 table 进行聚类。您的数据将自然聚类，您将得到所需的修剪。

聚类，在注入之前对数据进行排序是否会提高截断我的 table 的性能？

Clustering, does sorting data before injestion improve performance of truncating my table?

snowflake-cloud-data-platform