使用分区键和二级索引
Using partition key along with secondary index
以下是我需要执行的两个查询。
select * 从 dept = 100 和 emp_id = 1;
select * 来自 dept = 100 和 name = 'One';
下面哪个选项更好?
选项 1:使用二级索引和分区键。我假设这种方式查询会执行得更快,因为不需要去不同的节点,索引只需要在本地搜索。
cqlsh:d2> desc table emp_by_dept;
CREATE TABLE d2.emp_by_dept (
dept int,
emp_id int,
name text,
PRIMARY KEY (dept, emp_id)
) WITH CLUSTERING ORDER BY (emp_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX emp_by_dept_name_idx ON d2.emp_by_dept (name);
cqlsh:d2> select * from emp_by_dept where dept = 100;
dept | emp_id | name
------+--------+------
100 | 1 | One
100 | 2 | Two
100 | 10 | Ten
(3 rows)
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
Execute CQL3 query | 2015-06-15 17:36:55.860000 | 10.0.2.16 | 0
Parsing select * from emp_by_dept where dept = 100; [SharedPool-Worker-1] | 2015-06-15 17:36:55.861000 | 10.0.2.16 | 202
Preparing statement [SharedPool-Worker-1] | 2015-06-15 17:36:55.861000 | 10.0.2.16 | 418
Executing single-partition query on emp_by_dept [SharedPool-Worker-3] | 2015-06-15 17:36:55.871000 | 10.0.2.16 | 10525
Acquiring sstable references [SharedPool-Worker-3] | 2015-06-15 17:36:55.871000 | 10.0.2.16 | 10564
Merging memtable tombstones [SharedPool-Worker-3] | 2015-06-15 17:36:55.871000 | 10.0.2.16 | 10635
Key cache hit for sstable 1 [SharedPool-Worker-3] | 2015-06-15 17:36:55.871000 | 10.0.2.16 | 10748
Seeking to partition beginning in data file [SharedPool-Worker-3] | 2015-06-15 17:36:55.871000 | 10.0.2.16 | 10757
Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-3] | 2015-06-15 17:36:55.879000 | 10.0.2.16 | 18141
Merging data from memtables and 1 sstables [SharedPool-Worker-3] | 2015-06-15 17:36:55.879000 | 10.0.2.16 | 18166
Read 3 live and 0 tombstoned cells [SharedPool-Worker-3] | 2015-06-15 17:36:55.879000 | 10.0.2.16 | 18335
Request complete | 2015-06-15 17:36:55.928174 | 10.0.2.16 | 68174
cqlsh:d2> select * from emp_by_dept where dept = 100 and name = 'One';
dept | emp_id | name
------+--------+------
100 | 1 | One
(1 rows)
Tracing session: c56e70a0-1357-11e5-ab8b-fb5400f1b4af
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
Execute CQL3 query | 2015-06-15 17:42:20.010000 | 10.0.2.16 | 0
Parsing select * from emp_by_dept where dept = 100 and name = 'One'; [SharedPool-Worker-1] | 2015-06-15 17:42:20.010000 | 10.0.2.16 | 12
Preparing statement [SharedPool-Worker-1] | 2015-06-15 17:42:20.010000 | 10.0.2.16 | 19
Computing ranges to query [SharedPool-Worker-1] | 2015-06-15 17:42:20.011000 | 10.0.2.16 | 881
Candidate index mean cardinalities are CompositesIndexOnRegular{columnDefs=[ColumnDefinition{name=name, type=org.apache.cassandra.db.marshal.UTF8Type, kind=REGULAR, componentIndex=1, indexName=emp_by_dept_name_idx, indexType=COMPOSITES}]}:1. Scanning with emp_by_dept.emp_by_dept_name_idx. [SharedPool-Worker-1] | 2015-06-15 17:42:20.011000 | 10.0.2.16 | 1144
Submitting range requests on 1 ranges with a concurrency of 1 (0.003515625 rows per range expected) [SharedPool-Worker-1] | 2015-06-15 17:42:20.011000 | 10.0.2.16 | 1238
Executing indexed scan for [100, 100] [SharedPool-Worker-2] | 2015-06-15 17:42:20.011000 | 10.0.2.16 | 1703
Candidate index mean cardinalities are CompositesIndexOnRegular{columnDefs=[ColumnDefinition{name=name, type=org.apache.cassandra.db.marshal.UTF8Type, kind=REGULAR, componentIndex=1, indexName=emp_by_dept_name_idx, indexType=COMPOSITES}]}:1. Scanning with emp_by_dept.emp_by_dept_name_idx. [SharedPool-Worker-2] | 2015-06-15 17:42:20.012000 | 10.0.2.16 | 1827
Candidate index mean cardinalities are CompositesIndexOnRegular{columnDefs=[ColumnDefinition{name=name, type=org.apache.cassandra.db.marshal.UTF8Type, kind=REGULAR, componentIndex=1, indexName=emp_by_dept_name_idx, indexType=COMPOSITES}]}:1. Scanning with emp_by_dept.emp_by_dept_name_idx. [SharedPool-Worker-2] | 2015-06-15 17:42:20.012000 | 10.0.2.16 | 1929
Executing single-partition query on emp_by_dept.emp_by_dept_name_idx [SharedPool-Worker-2] | 2015-06-15 17:42:20.012000 | 10.0.2.16 | 2058
Acquiring sstable references [SharedPool-Worker-2] | 2015-06-15 17:42:20.012000 | 10.0.2.16 | 2087
Merging memtable tombstones [SharedPool-Worker-2] | 2015-06-15 17:42:20.012000 | 10.0.2.16 | 2173
Key cache hit for sstable 1 [SharedPool-Worker-2] | 2015-06-15 17:42:20.012000 | 10.0.2.16 | 2352
Seeking to partition indexed section in data file [SharedPool-Worker-2] | 2015-06-15 17:42:20.012001 | 10.0.2.16 | 2377
Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2015-06-15 17:42:20.014000 | 10.0.2.16 | 4300
Merging data from memtables and 1 sstables [SharedPool-Worker-2] | 2015-06-15 17:42:20.014000 | 10.0.2.16 | 4322
Submitted 1 concurrent range requests covering 1 ranges [SharedPool-Worker-1] | 2015-06-15 17:42:20.031000 | 10.0.2.16 | 21798
Read 1 live and 0 tombstoned cells [SharedPool-Worker-2] | 2015-06-15 17:42:20.032000 | 10.0.2.16 | 21989
Executing single-partition query on emp_by_dept [SharedPool-Worker-2] | 2015-06-15 17:42:20.032000 | 10.0.2.16 | 22374
Acquiring sstable references [SharedPool-Worker-2] | 2015-06-15 17:42:20.032000 | 10.0.2.16 | 22385
Merging memtable tombstones [SharedPool-Worker-2] | 2015-06-15 17:42:20.032000 | 10.0.2.16 | 22433
Key cache hit for sstable 1 [SharedPool-Worker-2] | 2015-06-15 17:42:20.032000 | 10.0.2.16 | 22514
Seeking to partition indexed section in data file [SharedPool-Worker-2] | 2015-06-15 17:42:20.032000 | 10.0.2.16 | 22523
Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2015-06-15 17:42:20.033000 | 10.0.2.16 | 22963
Merging data from memtables and 1 sstables [SharedPool-Worker-2] | 2015-06-15 17:42:20.033000 | 10.0.2.16 | 22972
Read 1 live and 0 tombstoned cells [SharedPool-Worker-2] | 2015-06-15 17:42:20.033000 | 10.0.2.16 | 22991
Scanned 1 rows and matched 1 [SharedPool-Worker-2] | 2015-06-15 17:42:20.033000 | 10.0.2.16 | 23096
Request complete | 2015-06-15 17:42:20.033227 | 10.0.2.16 | 23227
选项 2:创建 2 个表,如下所示。
CREATE TABLE d2.emp_by_dept (
dept int,
emp_id int,
name text,
PRIMARY KEY (dept, emp_id)
) WITH CLUSTERING ORDER BY (emp_id ASC);
select * from emp_by_dept where dept = 100 and emp_id = 1;
CREATE TABLE d2.emp_by_dept_name (
dept int,
emp_id int,
name text,
PRIMARY KEY (dept, name)
) WITH CLUSTERING ORDER BY (name ASC);
select * from emp_by_dept_name where dept = 100 and name = 'One';
选项一不可能,因为 Cassandra 不支持同时使用主键和辅助键的查询。最好的选择是选择选项二。
虽然相似点很多,但不要把它当成'relational table'。而是将其视为嵌套的、排序的地图数据结构。
Cassandra 相信数据的反规范化和复制可以提高读取性能。因此,选项 2 是完全正常的,并且在 Cassandra 的最佳实践范围内。
几个您可能会觉得有用的链接 - http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
希望对您有所帮助。
由于维护两个表比维护一个表更难,因此第一个选项更可取。
查询 1 = select * 来自 <>,其中部门 = 100 且 emp_id = 1;
查询 2 = select * 来自 <>,其中部门 = 100 且名称 = 'One';
选项 1:
写入:写入时间 emp_by_dept + 更新索引的时间
读取:Query1 将直接从 emp_by_dept 读取,Query2 将从 emp_by_dept 读取 + 从索引 table 获取位置 + 从 [=36] 读取值=]
选项 2:
写入:写入 emp_by_dept 的时间 + 写入 emp_by_dept_name
的时间
读取:Query1直接从emp_by_dept读取,Query2直接从emp_by_dept_name读取(需要的数据已经排序保存)
所以我假设两种情况下的写入时间应该几乎相同(我没有测试过)
如果您的阅读响应时间更重要,请选择选项 2。
如果您担心维持 2 tables,请选择选项 1。
感谢大家的意见。
通常情况下,将二级索引与分区键一起使用是一种很好的方法,因为 - 如您所说 - 二级键查找可以在一台机器上执行。
另一个需要考虑的概念是二级索引的基数。在您的情况下 emp_id 可能是唯一的,并且 name 几乎是唯一的,因此索引很可能 return 是一行,因此效率不高。为了更好的解释,我推荐这篇文章:http://www.wentnet.com/blog/?p=77.
因此,如果查询时间很紧并且您可以同时更新两个表,我建议您使用选项 2。
用一些生成的数据来衡量这两个选项也很有趣。
以下是我需要执行的两个查询。
select * 从 dept = 100 和 emp_id = 1;
select * 来自 dept = 100 和 name = 'One';
下面哪个选项更好?
选项 1:使用二级索引和分区键。我假设这种方式查询会执行得更快,因为不需要去不同的节点,索引只需要在本地搜索。
cqlsh:d2> desc table emp_by_dept;
CREATE TABLE d2.emp_by_dept (
dept int,
emp_id int,
name text,
PRIMARY KEY (dept, emp_id)
) WITH CLUSTERING ORDER BY (emp_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX emp_by_dept_name_idx ON d2.emp_by_dept (name);
cqlsh:d2> select * from emp_by_dept where dept = 100;
dept | emp_id | name
------+--------+------
100 | 1 | One
100 | 2 | Two
100 | 10 | Ten
(3 rows)
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
Execute CQL3 query | 2015-06-15 17:36:55.860000 | 10.0.2.16 | 0
Parsing select * from emp_by_dept where dept = 100; [SharedPool-Worker-1] | 2015-06-15 17:36:55.861000 | 10.0.2.16 | 202
Preparing statement [SharedPool-Worker-1] | 2015-06-15 17:36:55.861000 | 10.0.2.16 | 418
Executing single-partition query on emp_by_dept [SharedPool-Worker-3] | 2015-06-15 17:36:55.871000 | 10.0.2.16 | 10525
Acquiring sstable references [SharedPool-Worker-3] | 2015-06-15 17:36:55.871000 | 10.0.2.16 | 10564
Merging memtable tombstones [SharedPool-Worker-3] | 2015-06-15 17:36:55.871000 | 10.0.2.16 | 10635
Key cache hit for sstable 1 [SharedPool-Worker-3] | 2015-06-15 17:36:55.871000 | 10.0.2.16 | 10748
Seeking to partition beginning in data file [SharedPool-Worker-3] | 2015-06-15 17:36:55.871000 | 10.0.2.16 | 10757
Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-3] | 2015-06-15 17:36:55.879000 | 10.0.2.16 | 18141
Merging data from memtables and 1 sstables [SharedPool-Worker-3] | 2015-06-15 17:36:55.879000 | 10.0.2.16 | 18166
Read 3 live and 0 tombstoned cells [SharedPool-Worker-3] | 2015-06-15 17:36:55.879000 | 10.0.2.16 | 18335
Request complete | 2015-06-15 17:36:55.928174 | 10.0.2.16 | 68174
cqlsh:d2> select * from emp_by_dept where dept = 100 and name = 'One';
dept | emp_id | name
------+--------+------
100 | 1 | One
(1 rows)
Tracing session: c56e70a0-1357-11e5-ab8b-fb5400f1b4af
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
Execute CQL3 query | 2015-06-15 17:42:20.010000 | 10.0.2.16 | 0
Parsing select * from emp_by_dept where dept = 100 and name = 'One'; [SharedPool-Worker-1] | 2015-06-15 17:42:20.010000 | 10.0.2.16 | 12
Preparing statement [SharedPool-Worker-1] | 2015-06-15 17:42:20.010000 | 10.0.2.16 | 19
Computing ranges to query [SharedPool-Worker-1] | 2015-06-15 17:42:20.011000 | 10.0.2.16 | 881
Candidate index mean cardinalities are CompositesIndexOnRegular{columnDefs=[ColumnDefinition{name=name, type=org.apache.cassandra.db.marshal.UTF8Type, kind=REGULAR, componentIndex=1, indexName=emp_by_dept_name_idx, indexType=COMPOSITES}]}:1. Scanning with emp_by_dept.emp_by_dept_name_idx. [SharedPool-Worker-1] | 2015-06-15 17:42:20.011000 | 10.0.2.16 | 1144
Submitting range requests on 1 ranges with a concurrency of 1 (0.003515625 rows per range expected) [SharedPool-Worker-1] | 2015-06-15 17:42:20.011000 | 10.0.2.16 | 1238
Executing indexed scan for [100, 100] [SharedPool-Worker-2] | 2015-06-15 17:42:20.011000 | 10.0.2.16 | 1703
Candidate index mean cardinalities are CompositesIndexOnRegular{columnDefs=[ColumnDefinition{name=name, type=org.apache.cassandra.db.marshal.UTF8Type, kind=REGULAR, componentIndex=1, indexName=emp_by_dept_name_idx, indexType=COMPOSITES}]}:1. Scanning with emp_by_dept.emp_by_dept_name_idx. [SharedPool-Worker-2] | 2015-06-15 17:42:20.012000 | 10.0.2.16 | 1827
Candidate index mean cardinalities are CompositesIndexOnRegular{columnDefs=[ColumnDefinition{name=name, type=org.apache.cassandra.db.marshal.UTF8Type, kind=REGULAR, componentIndex=1, indexName=emp_by_dept_name_idx, indexType=COMPOSITES}]}:1. Scanning with emp_by_dept.emp_by_dept_name_idx. [SharedPool-Worker-2] | 2015-06-15 17:42:20.012000 | 10.0.2.16 | 1929
Executing single-partition query on emp_by_dept.emp_by_dept_name_idx [SharedPool-Worker-2] | 2015-06-15 17:42:20.012000 | 10.0.2.16 | 2058
Acquiring sstable references [SharedPool-Worker-2] | 2015-06-15 17:42:20.012000 | 10.0.2.16 | 2087
Merging memtable tombstones [SharedPool-Worker-2] | 2015-06-15 17:42:20.012000 | 10.0.2.16 | 2173
Key cache hit for sstable 1 [SharedPool-Worker-2] | 2015-06-15 17:42:20.012000 | 10.0.2.16 | 2352
Seeking to partition indexed section in data file [SharedPool-Worker-2] | 2015-06-15 17:42:20.012001 | 10.0.2.16 | 2377
Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2015-06-15 17:42:20.014000 | 10.0.2.16 | 4300
Merging data from memtables and 1 sstables [SharedPool-Worker-2] | 2015-06-15 17:42:20.014000 | 10.0.2.16 | 4322
Submitted 1 concurrent range requests covering 1 ranges [SharedPool-Worker-1] | 2015-06-15 17:42:20.031000 | 10.0.2.16 | 21798
Read 1 live and 0 tombstoned cells [SharedPool-Worker-2] | 2015-06-15 17:42:20.032000 | 10.0.2.16 | 21989
Executing single-partition query on emp_by_dept [SharedPool-Worker-2] | 2015-06-15 17:42:20.032000 | 10.0.2.16 | 22374
Acquiring sstable references [SharedPool-Worker-2] | 2015-06-15 17:42:20.032000 | 10.0.2.16 | 22385
Merging memtable tombstones [SharedPool-Worker-2] | 2015-06-15 17:42:20.032000 | 10.0.2.16 | 22433
Key cache hit for sstable 1 [SharedPool-Worker-2] | 2015-06-15 17:42:20.032000 | 10.0.2.16 | 22514
Seeking to partition indexed section in data file [SharedPool-Worker-2] | 2015-06-15 17:42:20.032000 | 10.0.2.16 | 22523
Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2015-06-15 17:42:20.033000 | 10.0.2.16 | 22963
Merging data from memtables and 1 sstables [SharedPool-Worker-2] | 2015-06-15 17:42:20.033000 | 10.0.2.16 | 22972
Read 1 live and 0 tombstoned cells [SharedPool-Worker-2] | 2015-06-15 17:42:20.033000 | 10.0.2.16 | 22991
Scanned 1 rows and matched 1 [SharedPool-Worker-2] | 2015-06-15 17:42:20.033000 | 10.0.2.16 | 23096
Request complete | 2015-06-15 17:42:20.033227 | 10.0.2.16 | 23227
选项 2:创建 2 个表,如下所示。
CREATE TABLE d2.emp_by_dept (
dept int,
emp_id int,
name text,
PRIMARY KEY (dept, emp_id)
) WITH CLUSTERING ORDER BY (emp_id ASC);
select * from emp_by_dept where dept = 100 and emp_id = 1;
CREATE TABLE d2.emp_by_dept_name (
dept int,
emp_id int,
name text,
PRIMARY KEY (dept, name)
) WITH CLUSTERING ORDER BY (name ASC);
select * from emp_by_dept_name where dept = 100 and name = 'One';
选项一不可能,因为 Cassandra 不支持同时使用主键和辅助键的查询。最好的选择是选择选项二。
虽然相似点很多,但不要把它当成'relational table'。而是将其视为嵌套的、排序的地图数据结构。 Cassandra 相信数据的反规范化和复制可以提高读取性能。因此,选项 2 是完全正常的,并且在 Cassandra 的最佳实践范围内。
几个您可能会觉得有用的链接 - http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
希望对您有所帮助。
由于维护两个表比维护一个表更难,因此第一个选项更可取。
查询 1 = select * 来自 <>,其中部门 = 100 且 emp_id = 1;
查询 2 = select * 来自 <>,其中部门 = 100 且名称 = 'One';
选项 1:
写入:写入时间 emp_by_dept + 更新索引的时间
读取:Query1 将直接从 emp_by_dept 读取,Query2 将从 emp_by_dept 读取 + 从索引 table 获取位置 + 从 [=36] 读取值=]
选项 2:
写入:写入 emp_by_dept 的时间 + 写入 emp_by_dept_name
的时间读取:Query1直接从emp_by_dept读取,Query2直接从emp_by_dept_name读取(需要的数据已经排序保存)
所以我假设两种情况下的写入时间应该几乎相同(我没有测试过)
如果您的阅读响应时间更重要,请选择选项 2。
如果您担心维持 2 tables,请选择选项 1。
感谢大家的意见。
通常情况下,将二级索引与分区键一起使用是一种很好的方法,因为 - 如您所说 - 二级键查找可以在一台机器上执行。
另一个需要考虑的概念是二级索引的基数。在您的情况下 emp_id 可能是唯一的,并且 name 几乎是唯一的,因此索引很可能 return 是一行,因此效率不高。为了更好的解释,我推荐这篇文章:http://www.wentnet.com/blog/?p=77.
因此,如果查询时间很紧并且您可以同时更新两个表,我建议您使用选项 2。
用一些生成的数据来衡量这两个选项也很有趣。