使用 GeoKettle 将行插入 PostgreSQL 数据库的最快方法是什么?
What is the fastest way to insert rows into a PostgreSQL Database with GeoKettle?
假设我有一个包含 1 亿行的 .csv 文件。我将该 csv 文件导入 pentaho Kettle 并希望将所有行写入 PostgreSQL 数据库。什么是最快的插入转换?我已经尝试了正常的 table 输出转换和 PostgreSQL Bulk Loader(比 table 输出快得多)。但是,它还是太慢了。有没有比使用 PostgreSQL Bulk Loader 更快的方法?
考虑到 PostgreSQL Bulk Loader 运行的事实 COPY table_name FROM STDIN
- 在 postgres 中加载数据没有比这更快的了。多值插入会比较慢,只有多值插入最慢。所以你不能让它更快。
要加快 COPY
您可以:
set commit_delay to 100000;
set synchronous_commit to off;
和其他服务器端技巧(如加载前删除索引)。
注意:
very old but still relevant depesz post
most probably won't work with pentaho Kettle,but worth of checking pgloader
更新
https://www.postgresql.org/docs/current/static/runtime-config-wal.html
synchronous_commit (enum)
Specifies whether transaction commit will wait for WAL records to be
written to disk before the command returns a “success” indication to
the client. Valid values are on, remote_apply, remote_write, local,
and off. The default, and safe, setting is on. When off, there can be
a delay between when success is reported to the client and when the
transaction is really guaranteed to be safe against a server crash.
(The maximum delay is three times wal_writer_delay.) Unlike fsync,
setting this parameter to off does not create any risk of database
inconsistency: an operating system or database crash might result in
some recent allegedly-committed transactions being lost, but the
database state will be just the same as if those transactions had been
aborted cleanly. So, turning synchronous_commit off can be a useful
alternative when performance is more important than exact certainty
about the durability of a transaction.
(强调我的)
另请注意,我建议在会话级别使用 SET
,因此如果 GeoKettle 不允许在 postgres 上的 运行 命令之前设置配置,您可以使用 pgbouncer connect_query
用于特定的 user/database 对,或者考虑一些其他技巧。如果您无法为每个会话设置 synchronous_commit
并且您决定按数据库或用户更改它(因此它将应用于 GeoKettle 连接,请不要忘记将其设置回 on
加载结束后
假设我有一个包含 1 亿行的 .csv 文件。我将该 csv 文件导入 pentaho Kettle 并希望将所有行写入 PostgreSQL 数据库。什么是最快的插入转换?我已经尝试了正常的 table 输出转换和 PostgreSQL Bulk Loader(比 table 输出快得多)。但是,它还是太慢了。有没有比使用 PostgreSQL Bulk Loader 更快的方法?
考虑到 PostgreSQL Bulk Loader 运行的事实 COPY table_name FROM STDIN
- 在 postgres 中加载数据没有比这更快的了。多值插入会比较慢,只有多值插入最慢。所以你不能让它更快。
要加快 COPY
您可以:
set commit_delay to 100000;
set synchronous_commit to off;
和其他服务器端技巧(如加载前删除索引)。
注意:
very old but still relevant depesz post
most probably won't work with pentaho Kettle,but worth of checking pgloader
更新
https://www.postgresql.org/docs/current/static/runtime-config-wal.html
synchronous_commit (enum)
Specifies whether transaction commit will wait for WAL records to be written to disk before the command returns a “success” indication to the client. Valid values are on, remote_apply, remote_write, local, and off. The default, and safe, setting is on. When off, there can be a delay between when success is reported to the client and when the transaction is really guaranteed to be safe against a server crash. (The maximum delay is three times wal_writer_delay.) Unlike fsync, setting this parameter to off does not create any risk of database inconsistency: an operating system or database crash might result in some recent allegedly-committed transactions being lost, but the database state will be just the same as if those transactions had been aborted cleanly. So, turning synchronous_commit off can be a useful alternative when performance is more important than exact certainty about the durability of a transaction.
(强调我的)
另请注意,我建议在会话级别使用 SET
,因此如果 GeoKettle 不允许在 postgres 上的 运行 命令之前设置配置,您可以使用 pgbouncer connect_query
用于特定的 user/database 对,或者考虑一些其他技巧。如果您无法为每个会话设置 synchronous_commit
并且您决定按数据库或用户更改它(因此它将应用于 GeoKettle 连接,请不要忘记将其设置回 on
加载结束后