使用 GeoKettle 将行插入 PostgreSQL 数据库的最快方法是什么？

Question

假设我有一个包含 1 亿行的 .csv 文件。我将该 csv 文件导入 pentaho Kettle 并希望将所有行写入 PostgreSQL 数据库。什么是最快的插入转换？我已经尝试了正常的 table 输出转换和 PostgreSQL Bulk Loader（比 table 输出快得多）。但是，它还是太慢了。有没有比使用 PostgreSQL Bulk Loader 更快的方法？

Answer 1

考虑到 PostgreSQL Bulk Loader 运行的事实 COPY table_name FROM STDIN - 在 postgres 中加载数据没有比这更快的了。多值插入会比较慢，只有多值插入最慢。所以你不能让它更快。

要加快 COPY 您可以：

set commit_delay to 100000;
set synchronous_commit to off;

和其他服务器端技巧（如加载前删除索引）。

注意：

very old but still relevant depesz post

most probably won't work with pentaho Kettle,but worth of checking pgloader

更新

https://www.postgresql.org/docs/current/static/runtime-config-wal.html

synchronous_commit (enum)

Specifies whether transaction commit will wait for WAL records to be written to disk before the command returns a “success” indication to the client. Valid values are on, remote_apply, remote_write, local, and off. The default, and safe, setting is on. When off, there can be a delay between when success is reported to the client and when the transaction is really guaranteed to be safe against a server crash. (The maximum delay is three times wal_writer_delay.) Unlike fsync, setting this parameter to off does not create any risk of database inconsistency: an operating system or database crash might result in some recent allegedly-committed transactions being lost, but the database state will be just the same as if those transactions had been aborted cleanly. So, turning synchronous_commit off can be a useful alternative when performance is more important than exact certainty about the durability of a transaction.

（强调我的）

另请注意，我建议在会话级别使用 SET，因此如果 GeoKettle 不允许在 postgres 上的运行命令之前设置配置，您可以使用 pgbouncer connect_query 用于特定的 user/database 对，或者考虑一些其他技巧。如果您无法为每个会话设置 synchronous_commit 并且您决定按数据库或用户更改它（因此它将应用于 GeoKettle 连接，请不要忘记将其设置回 on加载结束后

使用 GeoKettle 将行插入 PostgreSQL 数据库的最快方法是什么？

What is the fastest way to insert rows into a PostgreSQL Database with GeoKettle?

csv

postgresql

pentaho

insert

kettle