如何在 Cassandra 中实现跨多个表的插入，唯一且原子地？

Question

我有一个数据模型，其中域对象有两个字段，这两个字段必须都是唯一的，并且对象必须通过这两个字段独立地获取table。其中一个是随机生成的，因此我们可以假设没有可能发生碰撞。另一个是用户选择的。这是我想出的：

CREATE TABLE object_primary (
    generated_value text PRIMARY KEY,
    data blob
);

CREATE TABLE object_unique_index (
    user_value text PRIMARY KEY,
    generated_value text
);

这里我使用 object_unique_index 作为主 table 的索引和资源锁，其中资源是用户选择的全局唯一值。

第一想法：

插入object_unique_index必须使用IF NOT EXISTS。因此我不能使用批处理。
插入 object_primary 不会，因为生成器已经保证了唯一性。这让我可以使用自定义 TIMESTAMPs，避免在创建时回读。
如果第一次插入失败，我不应该继续第二次插入。
如果第二次插入失败，我应该回滚第一次插入。
系统不应该处于这样一种状态，其中一行存在于任何一列而没有另一列。
我愿意在回滚期间忽略错误，并将清理委托给外部（可能是手动）进程。

似乎很清楚如何进行，但我正在努力解释 非条件 更新的某些错误情况。所有现有的描述都假定您不关心最终结果是什么，稍后再尝试写入。

UnavailableException：没有足够的节点达到法定人数，但当它们重新联机时，保存的提示将重新运行写入。这是否意味着最终状态将是写入成功？如果是这样，什么读取一致性级别允许我看到它？如果不是，我怎么知道最终状态会是什么？

CassandraWriteTimeoutException: 有足够的节点达到法定人数，但有些节点没有及时回复。据我所知，这只是 UnavailableException 的一个更模糊的版本。在处理方式上有什么不同吗？

我的很多困惑来自相互矛盾的陈述 here:

The coordinator can force the results towards either the pre-update or post-update state.

[...]

the coordinator stores the update locally, and will re-send it to the failed replica when it recovers, thus forcing it to the post-update state that the client wanted originally

那么什么时候强制进入更新前的状态呢？我如何判断它是在 post-update 结束（所以我忽略它）还是在更新前（所以我回滚第一个插入）？

有没有办法解决这个问题而不要求所有插入都是有条件的，从而增加更多的性能损失并失去设置写入时间的能力？

Answer 1

关于 Cassandra error handling done right 的 DataStax 博客文章涵盖了这个问题中提出的大部分主题，我将在整个回答中引用该文章的部分内容。

系统不应处于这样一种状态，其中一行存在于任何一列而没有另一列。

使用 原子批处理 包含对两个 table 的写入。不要在批处理中使用 compare-and-set (CAS) 操作，如 IF NOT EXISTS。我稍后会介绍。

Cassandra 1.2 introduced atomic batches, which rely on a batch log to guarantee that all the mutations in the batch will eventually be applied. This atomicity is useful as it allows developers to execute several writes on different partitions without having to worry on which part will be applied and which won’t: either all or none of the mutations will be eventually written.

重要的是要注意，它并不能让您控制 恰好在 写入变得可见时，但是批处理的所有部分（最终）或 none 他们会（永远）。

user-selected 值必须是唯一的。（System-assigned 值假定已经始终是唯一的。）

轻量级事务（CAS 的另一个术语）是保证唯一性的唯一方法，如您所知。我建议创建一个 third table 专用于 条件写入 的目的，其定义与您在 object_unique_index 中定义的方式类似题;我称之为 unique_user_value_for_insert。通过为 插入路径 指定一个 table 只有这样，其他应用程序逻辑才会 "see" 由于竞争条件而导致的不一致状态，因为没有其他东西应该从这个 table;它的唯一用途是 IF NOT EXISTS 检查。（我假设批处理中的两个 table 都用于正常应用程序逻辑的 read 操作。）

INSERT user_value, generated_value INTO unique_user_value_for_insert IF NOT EXISTS;

如果在 [applied]=false 处插入 returns 结果集，则 user-provided 名称 不唯一 并且您不应该尝试批量插入。如果结果集表示[applied]=true则执行批处理

有了这个 CAS 插入和上面的批处理，你正常的 "happy" 通过这个逻辑的路径应该被覆盖。我们仍然需要处理可能的异常路径。

UnavailableException

When a request reaches the coordinator and there is not enough replica alive to achieve the requested consistency level, the driver will throw an UnavailableException. If you look carefully at this exception, you’ll notice that it’s possible to get the amount of replicas that were known to be alive when the error was triggered, as well as the amount of replicas that where required by the requested consistency level.

我不能作为异常的权威发言，但这个描述听起来协调节点抛出这个异常在任何尝试执行操作之前。如果为真，则初始 CAS 插入失败不需要恢复操作，超出您的应用程序识别插入未成功的原因除了唯一性违规之外。原子批处理失败（在 CAS 插入之后）表明您需要 "undo" CAS 插入。

我会通过发送具有非常宽松的写一致性级别（如 CL.ANY）的 DELETE 来撤消操作，以确保删除操作最有可能 persisted/replayed 到可能已执行的任何可用副本写。如果失败，则说明您的集群不健康。

CassandraWriteTimeoutException

If a write timeouts at the coordinator level, there is no way to know whether the mutation has been applied or not on the non-answering replica. ... [T]he way this error will be handled will thus depends on whether the write operation was idempotent (which is the case of most statements in CQL) or not (for counter updates, and append/prepend updates on lists).

根据上述两个操作中哪一个失败以及异常中显示的信息，处理此问题的方式会大不相同。我很遗憾不得不从博客中引用这么多，但它有很好的解释。首先，如果 CAS 插入操作失败并出现此异常：

If the paxos phase fails, the driver will throw a WriteTimeoutException with a WriteType.CAS as retrieved with WriteTimeoutException#getWriteType(). In this situation you can’t know if the CAS operation has been applied so you need to retry it in order to fallback on a stable state. Because lightweight transactions are much more expensive that regular updates, the driver doesn’t automatically retry it for you. The paxos phase can also lead to an UnavailableException if not enough replicas are available. In this situation, retries won’t help as only SERIAL and LOCAL_SERIAL consistencies are available.

这可能是您遇到的最复杂的故障。 Since "you can't know if the CAS operation has been applied" then retrying IF NOT EXISTS 在某些情况下是不明确的。如果你再试一次，成功了，那是最好的情况；重试插入的值仍然是唯一的，您可以继续进行批处理。如果IF NOT EXISTS失败那么有两种可能：

最初失败的 CAS 操作部分成功写入一个或多个副本。
该值实际上不是唯一的，原始操作应该返回 [applied]=false]。

我认为如果不对 其他一些 权威状态执行读取操作，您就无法区分这些情况。如果您采纳我的建议，将 "third" insert-only table 用于 CAS，则查询 other table(s) 以查看如果该名称存在现有数据。

或者，如果 CAS 插入在提交阶段失败：

The commit phase is then similar to regular Cassandra writes in the sense that it will throw an UnavailableException or a WriteTimeoutException if the amount of required replicas or acknowledges isn’t met. In this situation rather than retrying the entire CAS operation, you can simply ignore this error if you make sure to use setConsistencyLevel(ConsistencyLevel.SERIAL) on the subsequent read statements on the column that was touched by this transaction, as it will force Cassandra to commit any remaining uncommitted Paxos state before proceeding with the read. That being said, it probably won’t be easy to organize an application to use SERIAL reads after a CAS write failure, so you may prefer another alternative such as an entire retry of the CAS operation.

以上信息似乎也适用于批处理案例中此异常的失败，因为更接近 "regular" 写入而不是 Paxos 事务，具有以下附加信息：

If a timeout occurs when a batch is executed, the developer has different options depending on the type of write that timed out (see WriteTimeoutException#getWriteType()):

BATCH_LOG: a timeout occurred while the coordinator was waiting for the batch log replicas to acknowledge the log. Thus the batch may or may not be applied. By default, the driver will retry the batch query once, when it’s notified that such a timeout occurred. So if you receive this error, you may want to retry again, but that’s already a bad smell that the coordinator has been unlucky at picking replicas twice.
BATCH: a timeout occurred while reaching replicas for one of the changes in an atomic batch, after an entry has been successfully written to the batch log. Cassandra will thus ensure that this batch will get eventually written to the appropriate replicas and the developer doesn’t have to do anything. Note however that this error still means that all the columns haven’t all been updated yet. So if the immediate consistency of these writes is required in the business logic to be executed, you probably want to consider an alternate end, or a warning message to the end user.
UNLOGGED_BATCH: the coordinator met a timeout while reaching the replicas for a write query being part of an unlogged batch. This batch isn’t guaranteed to be atomic as no batch log entry is written, thus the parts of the batch that will or will not be applied are unknown. A retry of the entire batch will be required to fall back on a known state.

或者，使应用程序足够健壮以处理不一致的状态。

在我自己的应用程序中，我没有理会第三个 table。我的过程是：

插入由 system-generated 唯一值键入的记录，无需担心唯一性。
插入指向前一个 ID 的记录，给定 user-generated 值作为键，条件是 IF NOT EXISTS。这应该是最后执行的步骤。
如果出现任何故障，请尝试通过删除数据来撤消之前的操作。在这种情况下，我不会将系统生成的 ID 公开给用户，除了 compete 成功，并且（在我的应用程序中）没有扫描 system-generated ID，所以我的 worst-case 场景是 "dangling" 数据浪费 space.
执行定期的后台清理任务以查找并修复不一致。

Answer 2

tl;

的博士

只有无条件插入有疑问，所以调换操作顺序，并假定任何异常都会失败。只要你从不给出一个可能失败的 system-generated id，它就永远不会被使用，所以即使相应的 user-generated 值不唯一也没有关系。

插入由 system-generated 唯一值键入的记录，无需担心唯一性。
出现任何错误，删除一致性级别为ANY的记录，并使整个操作失败。
插入指向前一个 ID 的记录，给定 user-generated 值作为键，以 IF NOT EXISTS 为条件。
如前所述，判断是否真的成功，如果不成功，则执行与之前第2步insert相同的回滚
执行定期的后台清理任务以查找并修复不一致。

所以根本不需要担心解释不明确的异常。

如何在 Cassandra 中实现跨多个表的插入，唯一且原子地？

How to achieve an insert across multiple tables in Cassandra, uniquely and atomically?

transactions

cassandra