Postgres 在 table 中找到大约 13 亿行的唯一值不是很快

Question

所以我有一个（已记录）table，其中包含两列 A、B，其中包含文本。它们基本上包含相同类型的信息，只是两列，因为数据来自哪里。

我想要一个包含所有唯一值的 table（因此我将该列作为主键），而不关心该列。但是当我要求 postgres 做

insert into new_table(value) select A from old_table on conflict (value) 什么都不做；（稍后对 B 列进行同样的操作）

它使用了 1 个 cpu 核心，并且只从我的 SSD 读取大约 5 MB/s。几个小时后我停止了它。

我怀疑这可能是因为 b-tree 很慢，所以我在我的新 table 中的唯一属性上添加了一个 hashindex。但它仍然最大程度地使用 1 个核心，并以每秒仅 5 MB/s 的速度从 ssd 读取数据。我的 java 程序可以哈希设置至少 150 MB/s，所以 postgres 应该比 5 MB/s 快得多，对吗？我已经分析了我的旧 table 并且我将我的新 table 取消记录以便更快地插入，但它仍然使用 1 个内核并且读取速度非常慢。

如何解决这个问题？

编辑：这是对上述查询的解释。好像 postgres 正在使用它为主键创建的 b 树而不是我的（快得多，不是吗？？）哈希索引。

Insert on users  (cost=0.00..28648717.24 rows=1340108416 width=14)
  Conflict Resolution: NOTHING
  Conflict Arbiter Indexes: users_pkey
  ->  Seq Scan on games  (cost=0.00..28648717.24 rows=1340108416 width=14)

Answer 1

ON CONFLICT机制主要是为了解决并发引起的冲突。您可以在像这样的“静态”情况下使用它，但其他方法会更有效。

首先只插入不同的值：

insert into new_table(value) 
    select A from old_table union
    select B from old_table

为了提高性能，在填充 table 之前不要添加主键。并将 work_mem 设置为您可以相信的最大值。

My java program can hashset that at at least 150 MB/s,

这是完全在内存中使用哈希集。 PostgreSQL 索引是基于磁盘的结构。它们确实受益于缓存，但这仅到此为止，并且取决于您尚未告诉我们的硬件和设置。

Seems like postgres is using the b-tree it created for the primary key instead of my (much faster, isn't it??) Hash index.

只能使用定义约束的索引，即btree索引，因为hash索引不支持主键约束。您可以使用散列索引定义 EXCLUDE 约束，但这只会使其变慢。通常，散列索引不是比 PostgreSQL 中的 btree 索引“快得多”。

Postgres 在 table 中找到大约 13 亿行的唯一值不是很快

Postgres not very fast at finding unique values in table with about 1.3 billion rows

postgresql

indexing

performance

unique