AWS ELB 间歇性 502 网关超时错误

Question

我在 3 个 EC2 实例上部署的 node.js 应用程序前面有一个 ELB。

我已经开始观察间歇性 HTTP 502 错误网关错误。

以下是我的访问日志的摘录。这些 502 错误没有规律，所以我无法缩小原因范围？

这是 ELB 问题还是应用程序问题？

访问日志可以帮我解决这个问题吗？

每 100 个请求中有 5 个请求出现这种情况

*type*                     https    
*timestamp*                2019-05-08T14:50:11.438405Z  
*elb*                      <my-elb>
*client:port*              clientIp:port
*target:port*              targetIp:port
*request_processing_time*  0    
*target_processing_time*.  2.596    
*response_processing_time* -1   
*elb_status_code*          502  
*target_status_code*       -    
*received_bytes*           792  
*sent_bytes*               293  
*request*                  POST https://app/app-url/2.0/resourcepath/id/abc?queryParamA=abc&queryParamB=false&queryParamC=6b84c34 HTTP/1.1  
*user_agent*               Apache-CXF/3.2.5 
*ssl_cipher*               ssl-cipher
*ssl_protocol*             TLSv1.2  
*target_group_arn*         arn
*trace_id*                 traceId
*domain_name*              cool-domain-name
*chosen_cert_arn*          session-reused   
*matched_rule_priority*    0    
*request_creation_time*    2019-05-08T14:50:08.841000Z  
*actions_executed*         forward  
*redirect_url*             -    
*error_reason*             -

Answer 1

ELB 上的 502 通常指向 app/server 问题。 ELB 在与应用程序服务器通信时遇到问题。检查应用程序日志是否有重启或其他错误。

根据 RFC：

10.5.3 502 Bad Gateway
The server, while acting as a gateway or proxy, received an invalid response from the upstream server it accessed in attempting to fulfill the request.

可能的原因是空的或不完整的 headers 或响应 body，由断开的连接引起。还要在应用程序日志中查找 500 服务器错误。

在您的情况下，应用服务器崩溃会导致 ELB 出现 502 错误。

见

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

Answer 2

这里有一个参考 link 作为开始： https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html#http-502-issues

其中最常见的是后端 keep-alive 小于 ELB ，ELB 保持连接打开，而后端关闭它，当 ELB 使用相同的 TCP 连接时，它会被重置。

Answer 3

确保您的节点服务器keepAliveTimeout 大于ELB 空闲超时。 ELB 和 ALB 不喜欢目标机器关闭连接。要检测到这一点，您可以检查 elb 日志，您可能会看到响应时间为 -1 和 502。“-1”表示目标直接拒绝了请求。

Answer 4

确保您在 ALB/ELB 之后没有使用 Apache 的事件 MPM 模块（默认）。它动态关闭连接。试试工人 MPM。

Answer 5

检查您的目标群体并确保健康检查工作正常。在我们的例子中，我们有一个损坏的健康检查，将所有节点标记为不健康。我们有自动缩放，因此节点将 added/removed 来自组动态（取决于当前负载）。

然后 AWS 负载均衡器的这种奇怪行为开始了（来自官方文档 https://docs.aws.amazon.com/elasticloadbalancing/latest/application/elb-ag.pdf）：

If there is at least one healthy target in a target group, the load balancer routes requests only to the healthy targets. If a target group contains only unhealthy targets, the load balancer routes requests to the unhealthy targets.

由于这种行为，我们没有意识到健康检查被破坏，并且 502 发生的请求被路由到刚刚成为 added/removed to/from 目标组的节点。修复健康检查解决了这个问题。

AWS ELB 间歇性 502 网关超时错误

AWS ELB Intermittent 502 gateway timeout error

amazon-web-services

amazon-elb