对抗更好的独特性选择?

Counter a better choice for uniqueness?

我目前有以下 table 基本用户事件布局 table:

CREATE TABLE IF NOT EXISTS events.events_by_user(
    user text,
    added_week int,
    added_timestamp timestamp,
    event text,
    uuid uuid,
    PRIMARY KEY((user, added_week), added_timestamp, event, uuid))
WITH CLUSTERING ORDER BY(added_timestamp DESC)

因此,作为主键最后一列的 uuid 基本上保证了唯一性。同一用户的多个相同事件有可能在同一毫秒(时间戳)内发生。

另一种方法可能是(如果我没记错的话)删除 uuid 列并将其替换为计数器列,如下所示:

CREATE TABLE IF NOT EXISTS events.events_by_user(
    user text,
    added_week int,
    added_timestamp timestamp,
    event text,
    frequency counter,
    PRIMARY KEY((user, added_week), added_timestamp, event))
WITH CLUSTERING ORDER BY(added_timestamp DESC)

我的想法是,通过使用此计数器,我可以节省一些 space,而且我的行数也不会加宽太多。我不确定这是否会对维护此计数器产生其他性能影响,或者是否有任何其他原因导致这可能不是一个好主意?

为什么要使用计数器来 保存 space? C* 设计习惯是 使用 space 来 获得 效率。

回到你的问题,计数器对你能做的事情有很大的限制,例如,必须用在它们自己的表上,你可以有任意多的主键列,然后只有计数器列。它们只支持递增和递减操作,并且由于它们支持这两种操作,所以每个查询都不是幂等的。如果您可以忍受“计算”值的不准确性......(即使 C* 2.1+ 稍微缓解了一点,计算过度也是一个众所周知的问题)

这意味着您不能指定 event 列,因为它不是主键的一部分,所以您的设计无效。

回到您的唯一性要求,您可以使用 timeuuid 列类型。它们是基于时间的 Type 1 UUID,并提供相当低的冲突概率。来自 Cassandra wiki:

A Type 1 UUID consists of the following:

  • A timestamp consisting of a count of 100-nanosecond intervals since 00:00:00.00, 15 October 1582 (the date of Gregorian reform to the Christian calendar).

  • A version (which should have a value of 1).

  • A variant (which should have a value of 2).

  • A sequence number, which can be a counter or a pseudo-random number.

  • A "node" which will be the machines MAC address (which should make the UUID unique across machines).

The challenge with a UUID is to make it be unique for multiple processes running on a single machine and multiple threads running in a single process. The Type 1 UUID as specified above does neither. On a fast machine with multiple cores it is quite possible to have a UUID generated with the same time value. This can be remedied only if the sequence number can span threads and processes, something that is quite challenging to do efficiently.

The Time Based UUID referenced compensates for these issues by:

  • Only using the normal millisecond granularity returned by System.currentTimeMillis() and adjusting it to pretend to contain 100 ns counts.

  • Incrementing the time by 1 (in a non-threadsafe manner) whenever a duplicate time value is encountered.

  • Using a pseudo-random number associated with the UUID Class for the sequence number. Incrementing the time by 1 allows multiple threads to uniquely create up to 10,000 UUIDs in the same millisecond in the same process. Using a pseudo-random number for the sequence number provides a 1 in a 16,384 chance that each UUID Class will have a unique id.

These mechanisms provide a reasonable probability that the generated UUIDs will be unique. However, the issues to be aware of are:

  • The computer is capable of generating more than 10,000 UUIDs per microsecond.

  • Applications creating UUIDs on different threads could get duplicates since the time is not incremented in a thread-safe manner.

  • More than one instance of the Class is in the VM in different Class Loaders - this will be mitigated by each Class having its own sequence number.

  • There is no guarantee that two instances of a UUID in the same or different VMs will have a different sequence number - just a reasonable probability that they will.

实际上,C* 已经可以做您想做的事了。但是,如果您真的担心最终会重复,那么您需要自己进行适当的计数,我建议您在应用程序级别实现它。