刷新物化视图同时导致 table 膨胀

Refreshing materialized view CONCURRENTLY causes table bloat

在 PostgreSQL 9.5 中,我决定创建一个物化视图 "effects" 并安排每小时并发刷新,因为我希望它始终可用:

REFRESH MATERIALIZED VIEW CONCURRENTLY effects;

一开始一切都很好,我的物化视图正在刷新,磁盘 space 使用率或多或少保持不变。


问题

一段时间后,磁盘使用率开始线性增长。

我得出结论,这种增长的原因是物化视图,运行 来自 this 的查询回答得到以下结果:

               what                |  bytes/ct   | bytes_pretty | bytes_per_row
-----------------------------------+-------------+--------------+---------------
 core_relation_size                | 32224567296 | 30 GB        |         21140
 visibility_map                    |      991232 | 968 kB       |             0
 free_space_map                    |     7938048 | 7752 kB      |             5
 table_size_incl_toast             | 32233504768 | 30 GB        |         21146
 indexes_size                      | 22975922176 | 21 GB        |         15073
 total_size_incl_toast_and_indexes | 55209426944 | 51 GB        |         36220
 live_rows_in_text_representation  |   316152215 | 302 MB       |           207
 ------------------------------    |             |              |
 row_count                         |     1524278 |              |
 live_tuples                       |      676439 |              |
 dead_tuples                       |     1524208 |              |
(11 rows)

然后,我发现这个 table 最后一次自动清理是在两天前,由 运行:

SELECT relname, n_dead_tup, last_vacuum, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup desc;

我决定手动调用vacuum (VERBOSE) effects。它 运行 大约半小时并产生以下输出:

vacuum (VERBOSE) effects;
INFO:  vacuuming "public.effects"
INFO:  scanned index "effects_idx" to remove 129523454 row versions
DETAIL:  CPU 12.16s/55.76u sec elapsed 119.87 sec

INFO:  scanned index "effects_campaign_created_idx" to remove 129523454 row versions
DETAIL:  CPU 19.11s/154.59u sec elapsed 337.91 sec

INFO:  scanned index "effects_campaign_name_idx" to remove 129523454 row versions
DETAIL:  CPU 28.51s/151.16u sec elapsed 315.51 sec

INFO:  scanned index "effects_campaign_event_type_idx" to remove 129523454 row versions
DETAIL:  CPU 38.60s/373.59u sec elapsed 601.73 sec

INFO:  "effects": removed 129523454 row versions in 3865537 pages
DETAIL:  CPU 59.02s/36.48u sec elapsed 326.43 sec

INFO:  index "effects_idx" now contains 1524208 row versions in 472258 pages
DETAIL:  113679000 index row versions were removed.
463896 index pages have been deleted, 60386 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.01 sec.

INFO:  index "effects_campaign_created_idx" now contains 1524208 row versions in 664910 pages
DETAIL:  121637488 index row versions were removed.
41014 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.

INFO:  index "effects_campaign_name_idx" now contains 1524208 row versions in 711391 pages
DETAIL:  125650677 index row versions were removed.
696221 index pages have been deleted, 28150 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.

INFO:  index "effects_campaign_event_type_idx" now contains 1524208 row versions in 956018 pages
DETAIL:  127659042 index row versions were removed.
934288 index pages have been deleted, 32105 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.

INFO:  "effects": found 0 removable, 493 nonremovable row versions in 3880239 out of 3933663 pages
DETAIL:  0 dead row versions cannot be removed yet.

There were 666922 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 180.49s/788.60u sec elapsed 1799.42 sec.

INFO:  vacuuming "pg_toast.pg_toast_1371723"
INFO:  index "pg_toast_1371723_index" now contains 0 row versions in 1 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.

INFO:  "pg_toast_1371723": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.

VACUUM

此时我认为问题已解决,并开始思考什么会干扰 autovacuum。可以肯定的是,我再次 运行 查询以查找 table 的 space 用法,令我惊讶的是它没有改变。

只有在我调用REFRESH MATERIALIZED VIEW effects; 后才。只是现在检查 table 大小的查询输出是:

               what                | bytes/ct  | bytes_pretty | bytes_per_row
-----------------------------------+-----------+--------------+---------------
 core_relation_size                | 374005760 | 357 MB       |           245
 visibility_map                    |         0 | 0 bytes      |             0
 free_space_map                    |         0 | 0 bytes      |             0
 table_size_incl_toast             | 374013952 | 357 MB       |           245
 indexes_size                      | 213843968 | 204 MB       |           140
 total_size_incl_toast_and_indexes | 587857920 | 561 MB       |           385
 live_rows_in_text_representation  | 316175512 | 302 MB       |           207
 ------------------------------    |           |              |
 row_count                         |   1524385 |              |
 live_tuples                       |    676439 |              |
 dead_tuples                       |   1524208 |              |
(11 rows)

然后一切恢复正常...


问题

问题已解决,但仍有相当多的困惑

  1. 谁能解释一下我遇到的问题是什么?
  2. 以后如何避免这种情况?

首先,让我们解释一下膨胀

REFRESH MATERIALIZED CONCURRENTLYsrc/backend/commands/matview.c中实现,注释有启发:

/*
 * refresh_by_match_merge
 *
 * Refresh a materialized view with transactional semantics, while allowing
 * concurrent reads.
 *
 * This is called after a new version of the data has been created in a
 * temporary table.  It performs a full outer join against the old version of
 * the data, producing "diff" results.  This join cannot work if there are any
 * duplicated rows in either the old or new versions, in the sense that every
 * column would compare as equal between the two rows.  It does work correctly
 * in the face of rows which have at least one NULL value, with all non-NULL
 * columns equal.  The behavior of NULLs on equality tests and on UNIQUE
 * indexes turns out to be quite convenient here; the tests we need to make
 * are consistent with default behavior.  If there is at least one UNIQUE
 * index on the materialized view, we have exactly the guarantee we need.
 *
 * The temporary table used to hold the diff results contains just the TID of
 * the old record (if matched) and the ROW from the new table as a single
 * column of complex record type (if matched).
 *
 * Once we have the diff table, we perform set-based DELETE and INSERT
 * operations against the materialized view, and discard both temporary
 * tables.
 *
 * Everything from the generation of the new data to applying the differences
 * takes place under cover of an ExclusiveLock, since it seems as though we
 * would want to prohibit not only concurrent REFRESH operations, but also
 * incremental maintenance.  It also doesn't seem reasonable or safe to allow
 * SELECT FOR UPDATE or SELECT FOR SHARE on rows being updated or deleted by
 * this command.
 */

因此通过删除行并从临时 table 插入新行来刷新物化视图。这当然会导致死元组和 table 膨胀,您的 VACUUM (VERBOSE) 输出证实了这一点。

在某种程度上,这就是您为 CONCURRENTLY 支付的价格。

其次,让我们揭穿 VACUUM 无法删除死元组的神话

VACUUM 将删除死行,但它不能减少膨胀(这可以用 VACUUM (FULL) 来完成,但是这会像 REFRESH MATERIALIZED VIEW 一样锁定视图而没有 CONCURRENTLY).

我怀疑你用来确定死元组数量的查询只是一个估计,它使死元组数量错误。

一个演示所有这些的例子

CREATE TABLE tab AS SELECT id, 'row ' || id AS val FROM generate_series(1, 100000) AS id;

-- make sure autovacuum doesn't spoil our demonstration
CREATE MATERIALIZED VIEW tab_v WITH (autovacuum_enabled = off)
AS SELECT * FROM tab;

-- required for CONCURRENTLY
CREATE UNIQUE INDEX ON tab_v (id);

使用 pgstattuple 扩展准确测量 table 膨胀:

CREATE EXTENSION pgstattuple;

SELECT * FROM pgstattuple('tab_v');
-[ RECORD 1 ]------+--------
table_len          | 4431872
tuple_count        | 100000
tuple_len          | 3788895
tuple_percent      | 85.49
dead_tuple_count   | 0
dead_tuple_len     | 0
dead_tuple_percent | 0
free_space         | 16724
free_percent       | 0.38

现在让我们删除table中的一些行,刷新并再次测量:

DELETE FROM tab WHERE id BETWEEN 40001 AND 80000;

REFRESH MATERIALIZED VIEW CONCURRENTLY tab_v;

SELECT * FROM pgstattuple('tab_v');
-[ RECORD 1 ]------+--------
table_len          | 4431872
tuple_count        | 60000
tuple_len          | 2268895
tuple_percent      | 51.19
dead_tuple_count   | 40000
dead_tuple_len     | 1520000
dead_tuple_percent | 34.3
free_space         | 16724
free_percent       | 0.38

大量死元组。 VACUUM 摆脱这些:

VACUUM tab_v;

SELECT * FROM pgstattuple('tab_v');
-[ RECORD 1 ]------+--------
table_len          | 4431872
tuple_count        | 60000
tuple_len          | 2268895
tuple_percent      | 51.19
dead_tuple_count   | 0
dead_tuple_len     | 0
dead_tuple_percent | 0
free_space         | 1616724
free_percent       | 36.48

死元组不见了,但现在有很多空的space。

我正在添加上面提供的@Laurenz Albe 完整答案。腹胀还有另一种可能性。考虑以下情况:

您的视图很少更改(1000000 条记录,每个请求更改 100 条记录),但您仍然有 500000 个死元组。索引列中的原因可以为空。

如上面的答案所述,当视图同时具体化时,将重新创建一个副本并进行比较。比较使用强制唯一索引,但是空值呢? sql 中的空值永远不会彼此相等。因此,如果您的主键允许空值,则即使未更改也将始终重新创建包含空值的记录并将其添加到 table

为了解决这个问题,您可以通过添加额外的列来消除膨胀,将空列合并为一些从未使用过的值 (-1, to_timestamp(0), ... ) 并仅将此列用于主索引