在 Postgres 中使用可序列化事务级别的意外失败
Unexpected failures using Serializable transaction level in Postgres
我们正在开发一个轻量级 CRUD 应用程序,并已选择对我们的事务使用 Serializable
隔离级别。
但是,当增加我们环境的负载时,我们发现大量交易失败,我们认为这些交易不会造成任何问题。特别是,我们有一个事务,我们已经设法将其剥离为以下内容,但仍然存在问题:
transaction(Connection.TRANSACTION_SERIALIZABLE, 3) {
val record = MyRecord(UUID.randomUUID(), UUID.randomUUID(), DEFAULT_JSON)
myDao().getRecord(record.id)
myDao().addRecord(record)
}
转换为 SQL 为:
SELECT mytable.id, mytable.userId, mytable.json, mytable.deleted_at
FROM mytable
WHERE mytable.id = '93ea4a65-cd52-4d73-ae74-38055c1b066b'
INSERT INTO mytable (deleted_at, json, id, user_id)
VALUES (NULL, '{"version":7}', '93ea4a65-cd52-4d73-ae74-38055c1b066b', '026d3c48-cdc5-4748-927b-408712e00f89')
即,通过 PRIMARY KEY
UUID 列进行简单的先检索后插入。当增加它时(例如 40 个线程,每个 运行 连续 50 个事务),我们看到其中绝大多数都失败了,但出现以下异常:
o.p.u.PSQLException: ERROR: could not serialize access due to read/write dependencies among transactions
Detail: Reason code: Canceled on identification as a pivot, during write.
Hint: The transaction might succeed if retried.
at o.p.c.v.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2433)
at o.p.c.v.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2178)
at o.p.c.v.QueryExecutorImpl.execute(QueryExecutorImpl.java:306)
at o.p.jdbc.PgStatement.executeInternal(PgStatement.java:441)
at o.p.jdbc.PgStatement.execute(PgStatement.java:365)
at o.p.j.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:155)
at o.p.j.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:132)
at o.j.e.s.s.InsertStatement.execInsertFunction(InsertStatement.kt:86)
at o.j.e.s.s.InsertStatement.executeInternal(InsertStatement.kt:95)
at o.j.e.s.s.InsertStatement.executeInternal(InsertStatement.kt:12)
at o.j.e.s.s.Statement.executeIn$exposed(Statement.kt:59)
... 90 common frames omitted
Wrapped by: o.j.e.e.ExposedSQLException: org.postgresql.util.PSQLException: ERROR: could not serialize access due to read/write dependencies among transactions
Detail: Reason code: Canceled on identification as a pivot, during write.
Hint: The transaction might succeed if retried.
at o.j.e.s.s.Statement.executeIn$exposed(Statement.kt:61)
at o.j.e.s.Transaction.exec(Transaction.kt:129)
at o.j.e.s.Transaction.exec(Transaction.kt:123)
at o.j.e.s.s.Statement.execute(Statement.kt:29)
at o.j.e.sql.QueriesKt.insert(Queries.kt:44)
at g.c.e.d.MyDao.insertEvent(DefaultEventsDao.kt:40)
... 81 common frames omitted
在线程 运行 时深入研究 pg_locks
,我们可以看到:
| locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath
| page | 18496 | 17542 | 2 | <null> | <null> | <null> | <null> | <null> | <null> | 30/75 | 1467 | SIReadLock | True | False
| page | 18496 | 17542 | 5 | <null> | <null> | <null> | <null> | <null> | <null> | 34/45 | 1471 | SIReadLock | True | False
| page | 18496 | 17542 | 2 | <null> | <null> | <null> | <null> | <null> | <null> | 8/335 | 1446 | SIReadLock | True | False
| page | 18496 | 17542 | 1 | <null> | <null> | <null> | <null> | <null> | <null> | 31/65 | 1468 | SIReadLock | True | False
| page | 18496 | 17542 | 6 | <null> | <null> | <null> | <null> | <null> | <null> | 43/15 | 1480 | SIReadLock | True | False
| page | 18496 | 17542 | 4 | <null> | <null> | <null> | <null> | <null> | <null> | 5/357 | 1482 | SIReadLock | True | False
| page | 18496 | 17542 | 6 | <null> | <null> | <null> | <null> | <null> | <null> | 41/15 | 1478 | SIReadLock | True | False
| page | 18496 | 17542 | 6 | <null> | <null> | <null> | <null> | <null> | <null> | 40/30 | 1477 | SIReadLock | True | False
关系17542
对应我们table的pkey(通过查询pg_class
验证)。因此,事务似乎需要 SELECT
的页面锁定,因此失败,因为同时发生了对同一页面的其他插入。
这个故事越来越重要,因为我们注意到随着 table 的增长,我们的测试重现的失败率会降低(记录被分成更多的页面,因此发生的冲突更少)。
那么我的问题是:
- 这是 Postgres 中的预期行为吗?一个简单的 pkey 查找不应该最多需要一个元组锁,而不是页面锁吗?
- 这是我们应该期望能够在
Serializable
隔离级别上做的事情吗?降低到 repeatable read
可以消除问题,但我们不愿意在不了解更多的情况下这样做。
- 在这种情况下,我们应该做些什么来帮助 Postgres?例如。查询注释或我们可以启用的设置?
我们在用 Kotlin 编写的 Ktor
后端中使用 Exposed
,以防相关。 Out Postgres 版本是 9.6.
For optimal performance when relying on Serializable transactions for concurrency control, these issues should be considered:
[...]
- When the system is forced to combine multiple page-level predicate locks into a single relation-level predicate lock because the predicate lock table is short of memory, an increase in the rate of serialization failures may occur. You can avoid this by increasing
max_pred_locks_per_transaction
, max_pred_locks_per_relation
, and/or max_pred_locks_per_page
.
在您的测试用例中,一个 table 页上有三个或更多谓词锁,因此锁升级为页锁。这就是事务相互冲突的原因(它们影响同一页面)。
尝试增加 max_pred_locks_per_page
.
我们正在开发一个轻量级 CRUD 应用程序,并已选择对我们的事务使用 Serializable
隔离级别。
但是,当增加我们环境的负载时,我们发现大量交易失败,我们认为这些交易不会造成任何问题。特别是,我们有一个事务,我们已经设法将其剥离为以下内容,但仍然存在问题:
transaction(Connection.TRANSACTION_SERIALIZABLE, 3) {
val record = MyRecord(UUID.randomUUID(), UUID.randomUUID(), DEFAULT_JSON)
myDao().getRecord(record.id)
myDao().addRecord(record)
}
转换为 SQL 为:
SELECT mytable.id, mytable.userId, mytable.json, mytable.deleted_at
FROM mytable
WHERE mytable.id = '93ea4a65-cd52-4d73-ae74-38055c1b066b'
INSERT INTO mytable (deleted_at, json, id, user_id)
VALUES (NULL, '{"version":7}', '93ea4a65-cd52-4d73-ae74-38055c1b066b', '026d3c48-cdc5-4748-927b-408712e00f89')
即,通过 PRIMARY KEY
UUID 列进行简单的先检索后插入。当增加它时(例如 40 个线程,每个 运行 连续 50 个事务),我们看到其中绝大多数都失败了,但出现以下异常:
o.p.u.PSQLException: ERROR: could not serialize access due to read/write dependencies among transactions
Detail: Reason code: Canceled on identification as a pivot, during write.
Hint: The transaction might succeed if retried.at o.p.c.v.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2433)
at o.p.c.v.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2178)
at o.p.c.v.QueryExecutorImpl.execute(QueryExecutorImpl.java:306)
at o.p.jdbc.PgStatement.executeInternal(PgStatement.java:441)
at o.p.jdbc.PgStatement.execute(PgStatement.java:365)
at o.p.j.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:155) at o.p.j.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:132) at o.j.e.s.s.InsertStatement.execInsertFunction(InsertStatement.kt:86) at o.j.e.s.s.InsertStatement.executeInternal(InsertStatement.kt:95) at o.j.e.s.s.InsertStatement.executeInternal(InsertStatement.kt:12) at o.j.e.s.s.Statement.executeIn$exposed(Statement.kt:59) ... 90 common frames omitted Wrapped by: o.j.e.e.ExposedSQLException: org.postgresql.util.PSQLException: ERROR: could not serialize access due to read/write dependencies among transactions Detail: Reason code: Canceled on identification as a pivot, during write. Hint: The transaction might succeed if retried. at o.j.e.s.s.Statement.executeIn$exposed(Statement.kt:61) at o.j.e.s.Transaction.exec(Transaction.kt:129) at o.j.e.s.Transaction.exec(Transaction.kt:123) at o.j.e.s.s.Statement.execute(Statement.kt:29) at o.j.e.sql.QueriesKt.insert(Queries.kt:44) at g.c.e.d.MyDao.insertEvent(DefaultEventsDao.kt:40) ... 81 common frames omitted
在线程 运行 时深入研究 pg_locks
,我们可以看到:
| locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath
| page | 18496 | 17542 | 2 | <null> | <null> | <null> | <null> | <null> | <null> | 30/75 | 1467 | SIReadLock | True | False
| page | 18496 | 17542 | 5 | <null> | <null> | <null> | <null> | <null> | <null> | 34/45 | 1471 | SIReadLock | True | False
| page | 18496 | 17542 | 2 | <null> | <null> | <null> | <null> | <null> | <null> | 8/335 | 1446 | SIReadLock | True | False
| page | 18496 | 17542 | 1 | <null> | <null> | <null> | <null> | <null> | <null> | 31/65 | 1468 | SIReadLock | True | False
| page | 18496 | 17542 | 6 | <null> | <null> | <null> | <null> | <null> | <null> | 43/15 | 1480 | SIReadLock | True | False
| page | 18496 | 17542 | 4 | <null> | <null> | <null> | <null> | <null> | <null> | 5/357 | 1482 | SIReadLock | True | False
| page | 18496 | 17542 | 6 | <null> | <null> | <null> | <null> | <null> | <null> | 41/15 | 1478 | SIReadLock | True | False
| page | 18496 | 17542 | 6 | <null> | <null> | <null> | <null> | <null> | <null> | 40/30 | 1477 | SIReadLock | True | False
关系17542
对应我们table的pkey(通过查询pg_class
验证)。因此,事务似乎需要 SELECT
的页面锁定,因此失败,因为同时发生了对同一页面的其他插入。
这个故事越来越重要,因为我们注意到随着 table 的增长,我们的测试重现的失败率会降低(记录被分成更多的页面,因此发生的冲突更少)。
那么我的问题是:
- 这是 Postgres 中的预期行为吗?一个简单的 pkey 查找不应该最多需要一个元组锁,而不是页面锁吗?
- 这是我们应该期望能够在
Serializable
隔离级别上做的事情吗?降低到repeatable read
可以消除问题,但我们不愿意在不了解更多的情况下这样做。 - 在这种情况下,我们应该做些什么来帮助 Postgres?例如。查询注释或我们可以启用的设置?
我们在用 Kotlin 编写的 Ktor
后端中使用 Exposed
,以防相关。 Out Postgres 版本是 9.6.
For optimal performance when relying on Serializable transactions for concurrency control, these issues should be considered:
[...]
- When the system is forced to combine multiple page-level predicate locks into a single relation-level predicate lock because the predicate lock table is short of memory, an increase in the rate of serialization failures may occur. You can avoid this by increasing
max_pred_locks_per_transaction
,max_pred_locks_per_relation
, and/ormax_pred_locks_per_page
.
在您的测试用例中,一个 table 页上有三个或更多谓词锁,因此锁升级为页锁。这就是事务相互冲突的原因(它们影响同一页面)。
尝试增加 max_pred_locks_per_page
.