带有 Postgres 的 AWS RDS:是否配置了 OOM 杀手

AWS RDS with Postgres : Is OOM killer configured

我们正在 运行 针对访问 Postgres 数据库的应用程序进行负载测试。

在测试过程中,我们突然发现错误率增加了。 在分析平台和应用程序行为后,我们注意到:

在 postgres 日志中,我们看到:

2018-08-21 08:19:48 UTC::@:[XXXXX]:LOG: server process (PID XXXX) was terminated by signal 9: Killed

调查和阅读文档后,似乎有一种可能性是 linux oomkiller 运行 终止了进程。

但是由于我们在 RDS 上,我们无法访问系统日志 /var/log 条消息来确认。

有人也可以:

我没有在这里找到答案:

AWS 为其 RDS 服务维护一个包含最佳实践的页面:https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_BestPractices.html

在内存分配方面,建议是:

An Amazon RDS performance best practice is to allocate enough RAM so that your working set resides almost completely in memory. To tell if your working set is almost all in memory, check the ReadIOPS metric (using Amazon CloudWatch) while the DB instance is under load. The value of ReadIOPS should be small and stable. If scaling up the DB instance class—to a class with more RAM—results in a dramatic drop in ReadIOPS, your working set was not almost completely in memory. Continue to scale up until ReadIOPS no longer drops dramatically after a scaling operation, or ReadIOPS is reduced to a very small amount. For information on monitoring a DB instance's metrics, see Viewing DB Instance Metrics.

此外,这是他们解决可能 OS 问题的建议:

Amazon RDS provides metrics in real time for the operating system (OS) that your DB instance runs on. You can view the metrics for your DB instance using the console, or consume the Enhanced Monitoring JSON output from Amazon CloudWatch Logs in a monitoring system of your choice. For more information about Enhanced Monitoring, see Enhanced Monitoring

那里有很多好的建议,包括查询调优。

请注意,作为最后的手段,您可以切换到 Aurora,它与 PostgreSQL 兼容:

Aurora features a distributed, fault-tolerant, self-healing storage system that auto-scales up to 64TB per database instance. Aurora delivers high performance and availability with up to 15 low-latency read replicas, point-in-time recovery, continuous backup to Amazon S3, and replication across three Availability Zones.

编辑:具体讨论你的 PostgreSQL 问题,检查这个Stack Exchange thread——他们与自动提交设置为 false 的长期联系。

We had a long connection with auto commit set to false:

connection.setAutoCommit(false)

During that time we were doing a lot of small queries and a few queries with a cursor:

statement.setFetchSize(SOME_FETCH_SIZE)

In JDBC you create a connection object, and from that connection you create statements. When you execute the statments you get a result set.

Now, every one of these objects needs to be closed, but if you close statement, the entry set is closed, and if you close the connection all the statements are closed and their result sets.

We were used to short living queries with connections of their own so we never closed statements assuming the connection will handle the things once it is closed.

The problem was now with this long transaction (~24 hours) which never closed the connection. The statements were never closed. Apparently, the statement object holds resources both on the server that runs the code and on the PostgreSQL database.

My best guess to what resources are left in the DB is the things related to the cursor. The statements that used the cursor were never closed, so the result set they returned never closed as well. This meant the database didn't free the relevant cursor resources in the DB, and since it was over a huge table it took a lot of RAM.

希望对您有所帮助!

TLDR:如果您需要 AWS 上的 PostgreSQL 并且需要坚如磐石的稳定性,运行 EC2 上的 PostgreSQL(目前)并为过度使用做一些内核调整


我会尽量简明扼要,但您不是唯一看到此问题的人,这是 RDS 和 Aurora PostgreSQL 的一个已知(Amazon 内部)问题。

RDS/Aurora

上的 OOM 杀手

OOM 杀手在 RDS 和 Aurora 实例上执行 运行,因为它们由 linux 个 VM 支持,并且 OOM 是内核的组成部分。

根本原因

根本原因是默认的 Linux 内核配置假定您有虚拟内存(交换文件或分区),但 EC2 实例(以及支持 RDS 和 Aurora 的虚拟机)没有虚拟内存默认。只有一个分区,没有定义交换文件。当 linux 认为它有虚拟内存时,它使用一种称为 "overcommitting" 的策略,这意味着它允许进程请求并被授予比系统实际拥有的内存量更大的内存量。两个可调参数控制此行为:

vm.overcommit_memory - 控制内核是否允许过度使用(0=是=默认) vm.overcommit_ratio - 内核可以过度使用的系统+交换百分比。如果你有 8GB 的​​ ram 和 8GB 的​​交换空间,并且你的 vm.overcommit_ratio = 75,内核将授予最多 12GB 或内存给进程。

我们设置了一个 EC2 实例(我们可以在其中调整这些参数)并且以下设置完全阻止了 PostgreSQL 后端被杀死:

vm.overcommit_memory = 2
vm.overcommit_ratio = 75

vm.overcommit_memory = 2 告诉 linux 不要过度使用(在系统内存的限制下工作)并且 vm.overcommit_ratio = 75 告诉 linux 不要授予更多请求超过 75% 的内存(只允许用户进程获得最多 75% 的内存)。

我们与 AWS 有一个未决案例,他们已承诺提出 long-term 修复(使用内核调整参数或 cgroup 等),但我们还没有 ETA。如果您遇到此问题,我鼓励您使用 AWS 和参考案例 #5881116231 创建一个案例,以便他们知道您也受到此问题的影响。

简而言之,如果您需要近期的稳定性,请在 EC2 上使用 PostgreSQL。如果您必须使用 RDS 或 Aurora PostgreSQL,您将需要超大实例(需要额外付费)并希望最好,因为超大并不能保证您不会仍然遇到问题。