在 Postgres 中使用可序列化事务级别的意外失败

Unexpected failures using Serializable transaction level in Postgres

我们正在开发一个轻量级 CRUD 应用程序,并已选择对我们的事务使用 Serializable 隔离级别。

但是,当增加我们环境的负载时,我们发现大量交易失败,我们认为这些交易不会造成任何问题。特别是,我们有一个事务,我们已经设法将其剥离为以下内容,但仍然存在问题:

transaction(Connection.TRANSACTION_SERIALIZABLE, 3) {
    val record = MyRecord(UUID.randomUUID(), UUID.randomUUID(), DEFAULT_JSON)
    myDao().getRecord(record.id)
    myDao().addRecord(record)
}

转换为 SQL 为:

SELECT mytable.id, mytable.userId, mytable.json, mytable.deleted_at 
FROM mytable 
WHERE mytable.id = '93ea4a65-cd52-4d73-ae74-38055c1b066b'

INSERT INTO mytable (deleted_at, json, id, user_id) 
VALUES (NULL, '{"version":7}', '93ea4a65-cd52-4d73-ae74-38055c1b066b', '026d3c48-cdc5-4748-927b-408712e00f89')

即,通过 PRIMARY KEY UUID 列进行简单的先检索后插入。当增加它时(例如 40 个线程,每个 运行 连续 50 个事务),我们看到其中绝大多数都失败了,但出现以下异常:

o.p.u.PSQLException: ERROR: could not serialize access due to read/write dependencies among transactions

Detail: Reason code: Canceled on identification as a pivot, during write.
Hint: The transaction might succeed if retried.

at o.p.c.v.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2433)
at o.p.c.v.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2178)
at o.p.c.v.QueryExecutorImpl.execute(QueryExecutorImpl.java:306)
at o.p.jdbc.PgStatement.executeInternal(PgStatement.java:441)
at o.p.jdbc.PgStatement.execute(PgStatement.java:365)
at o.p.j.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:155) at o.p.j.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:132) at o.j.e.s.s.InsertStatement.execInsertFunction(InsertStatement.kt:86) at o.j.e.s.s.InsertStatement.executeInternal(InsertStatement.kt:95) at o.j.e.s.s.InsertStatement.executeInternal(InsertStatement.kt:12) at o.j.e.s.s.Statement.executeIn$exposed(Statement.kt:59) ... 90 common frames omitted Wrapped by: o.j.e.e.ExposedSQLException: org.postgresql.util.PSQLException: ERROR: could not serialize access due to read/write dependencies among transactions Detail: Reason code: Canceled on identification as a pivot, during write. Hint: The transaction might succeed if retried. at o.j.e.s.s.Statement.executeIn$exposed(Statement.kt:61) at o.j.e.s.Transaction.exec(Transaction.kt:129) at o.j.e.s.Transaction.exec(Transaction.kt:123) at o.j.e.s.s.Statement.execute(Statement.kt:29) at o.j.e.sql.QueriesKt.insert(Queries.kt:44) at g.c.e.d.MyDao.insertEvent(DefaultEventsDao.kt:40) ... 81 common frames omitted

在线程 运行 时深入研究 pg_locks,我们可以看到:

| locktype      | database   | relation   | page   | tuple   | virtualxid   | transactionid   | classid   | objid   | objsubid   | virtualtransaction   | pid   | mode             | granted   | fastpath 

| page          | 18496      | 17542      | 2      | <null>  | <null>       | <null>          | <null>    | <null>  | <null>     | 30/75                | 1467  | SIReadLock       | True      | False       
| page          | 18496      | 17542      | 5      | <null>  | <null>       | <null>          | <null>    | <null>  | <null>     | 34/45                | 1471  | SIReadLock       | True      | False      
| page          | 18496      | 17542      | 2      | <null>  | <null>       | <null>          | <null>    | <null>  | <null>     | 8/335                | 1446  | SIReadLock       | True      | False      
| page          | 18496      | 17542      | 1      | <null>  | <null>       | <null>          | <null>    | <null>  | <null>     | 31/65                | 1468  | SIReadLock       | True      | False      
| page          | 18496      | 17542      | 6      | <null>  | <null>       | <null>          | <null>    | <null>  | <null>     | 43/15                | 1480  | SIReadLock       | True      | False      
| page          | 18496      | 17542      | 4      | <null>  | <null>       | <null>          | <null>    | <null>  | <null>     | 5/357                | 1482  | SIReadLock       | True      | False      
| page          | 18496      | 17542      | 6      | <null>  | <null>       | <null>          | <null>    | <null>  | <null>     | 41/15                | 1478  | SIReadLock       | True      | False       
| page          | 18496      | 17542      | 6      | <null>  | <null>       | <null>          | <null>    | <null>  | <null>     | 40/30                | 1477  | SIReadLock       | True      | False   

关系17542对应我们table的pkey(通过查询pg_class验证)。因此,事务似乎需要 SELECT 的页面锁定,因此失败,因为同时发生了对同一页面的其他插入。

这个故事越来越重要,因为我们注意到随着 table 的增长,我们的测试重现的失败率会降低(记录被分成更多的页面,因此发生的冲突更少)。

那么我的问题是:

我们在用 Kotlin 编写的 Ktor 后端中使用 Exposed,以防相关。 Out Postgres 版本是 9.6.

这是working as expected:

For optimal performance when relying on Serializable transactions for concurrency control, these issues should be considered:

[...]

在您的测试用例中,一个 table 页上有三个或更多谓词锁,因此锁升级为页锁。这就是事务相互冲突的原因(它们影响同一页面)。

尝试增加 max_pred_locks_per_page.