EC2 Java StartInstancesRequest 从 "pending" 到 "stopping" 再到 "stopped"
EC2 Java StartInstancesRequest goes from "pending" to "stopping" to "stopped"
我有以下情况:
- 专用租赁
m4.large
EC2 实例 运行 RHEL6
- 使用 AWS 控制台手动启动它工作正常
- 尝试启动它的 Lambda 函数(写在 Java 中)失败,因为实例状态变为:已停止 -> 待定 -> 正在停止 -> 已停止
我有一个 Lambda 函数可以记录 VPC 中的所有 EC2 状态更改,如下所示:
'use strict';
exports.handler = (event, context, callback) => {
console.log('LogEC2InstanceStateChange');
console.log('Received event:', JSON.stringify(event, null, 2));
callback(null, 'Finished');
}
还有一个 Lambda 函数尝试根据计划启动 EC2 实例,写在 Java 中,代码很多,但它的核心是这样的:
public void handleRequest(Object input, Context context) {
final List<String> instancesToStart = getInstancesToStart(); //implementation not shown
try {
StartInstancesRequest startRequest = new StartInstancesRequest().withInstanceIds((String[]) instancesToStart.toArray());
context.logger.log("StartInstancesRequest: " + startRequest.toString());
StartInstancesResult res = ec2.startInstances(startRequest);
context.logger.log("StartInstancesResult: " + res.toString());
}
catch(Exception e) {
logException(e); //calls context.logger.log on the stack trace string
}
}
instancesToStart
数组填充了像 i-0abcdef1234567890
这样的实例 ID。
我使用 CloudFormation 创建 Lambda 函数和所有必需的 IAM 角色等。下面是描述基于 Java 的 Lambda 函数的 role/permissions 的位:
Resources:
EC2SchedulerRole:
Type: 'AWS::IAM::Role'
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
Action:
- 'sts:AssumeRole'
Path: /
EC2SchedulerPolicy:
DependsOn:
- EC2SchedulerRole
Type: 'AWS::IAM::Policy'
Properties:
PolicyName: ec2-scheduler-role
Roles:
- !Ref EC2SchedulerRole
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action:
- 'logs:*'
Resource:
- 'arn:aws:logs:*:*:*'
- Effect: Allow
Action:
- 'ec2:DescribeInstanceAttribute'
- 'ec2:DescribeInstanceStatus'
- 'ec2:DescribeInstances'
- 'ec2:StartInstances'
- 'ec2:StopInstances'
- 'ec2:DeleteTags'
Resource:
- '*'
最终发生的事情是,根据第一个函数(记录实例状态转换的脚本)的 CloudWatch 日志,我们得到:
Received event:
{
"version": "0",
"id": "<guid>",
"detail-type": "EC2 Instance State-change Notification",
"source": "aws.ec2",
"account": "12345678",
"time": "2019-06-20T19:01:35Z",
"region": "us-east-1",
"resources": [
"arn:aws:ec2:us-east-1:12345678:instance/i-0abcdef12345678"
],
"detail": {
"instance-id": "i-0abcdef12345678",
"state": "pending"
}
}
Received event:
{
"version": "0",
"id": "<guid>",
"detail-type": "EC2 Instance State-change Notification",
"source": "aws.ec2",
"account": "12345678",
"time": "2019-06-20T19:01:37Z",
"region": "us-east-1",
"resources": [
"arn:aws:ec2:us-east-1:12345678:instance/i-0abcdef12345678"
],
"detail": {
"instance-id": "i-0abcdef12345678",
"state": "stopping"
}
}
Received event:
{
"version": "0",
"id": "<guid>",
"detail-type": "EC2 Instance State-change Notification",
"source": "aws.ec2",
"account": "12345678",
"time": "2019-06-20T19:01:37Z",
"region": "us-east-1",
"resources": [
"arn:aws:ec2:us-east-1:12345678:instance/i-0abcdef12345678"
],
"detail": {
"instance-id": "i-0abcdef12345678",
"state": "stopped"
}
}
根据 "worker" 函数(实际尝试启动实例的函数)的 CloudWatch 日志,我们得到:
StartInstancesRequest: {InstanceIds: [i-0abcdef12345678],}
StartInstancesResult: {StartingInstances: [{CurrentState: {Code: 0,Name: pending},InstanceId: i-0abcdef12345678,PreviousState: {Code: 80,Name: stopped}}]}
因此,从完成工作的基于 Java 的 Lambda 的角度来看,它正在做它需要做的所有事情,发出启动 EC2 实例的命令;但是当 EC2 实例尝试实际启动时,它会从 "pending" 到 "stopping" 再到 "stopped"。如果没有许可,它甚至不会走那么远,对吧?
如果是实例本身的问题(例如硬件),我预计使用 AWS 控制台手动启动它会失败。但它不会失败。手动启动成功!
发生了什么事?我该如何进一步诊断?是权限还是实例搞砸了?
我 99% 确定这 不是 由于 AZ 中可用容量不足,因为每当我尝试手动启动实例时它总是有效。这不是一个短暂的问题,也不是最近才发生的事情。这种情况已经持续了几个月,其中手动启动 100% 的时间有效,而基于脚本的启动在 0% 的时间有效。
试试这个策略,看看它是否有效。如果是,则政策有问题:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:*"
},
{
"Effect": "Allow",
"Action": [
"ec2:Start*",
"ec2:Stop*"
],
"Resource": "*"
}
]
}
启动 EBS 可能是问题所在。正如您所提到的,EC2 有 3 个采用 KMS 加密的 EBS 卷。您必须提供 KMS 权限 (kms:CreateGrant) 才能启动您的实例
{
"Sid": "GrantAccess",
"Effect": "Allow",
"Action": "kms:CreateGrant",
"Resource": "arn:aws:kms:::key/1234"
}
我有以下情况:
- 专用租赁
m4.large
EC2 实例 运行 RHEL6 - 使用 AWS 控制台手动启动它工作正常
- 尝试启动它的 Lambda 函数(写在 Java 中)失败,因为实例状态变为:已停止 -> 待定 -> 正在停止 -> 已停止
我有一个 Lambda 函数可以记录 VPC 中的所有 EC2 状态更改,如下所示:
'use strict';
exports.handler = (event, context, callback) => {
console.log('LogEC2InstanceStateChange');
console.log('Received event:', JSON.stringify(event, null, 2));
callback(null, 'Finished');
}
还有一个 Lambda 函数尝试根据计划启动 EC2 实例,写在 Java 中,代码很多,但它的核心是这样的:
public void handleRequest(Object input, Context context) {
final List<String> instancesToStart = getInstancesToStart(); //implementation not shown
try {
StartInstancesRequest startRequest = new StartInstancesRequest().withInstanceIds((String[]) instancesToStart.toArray());
context.logger.log("StartInstancesRequest: " + startRequest.toString());
StartInstancesResult res = ec2.startInstances(startRequest);
context.logger.log("StartInstancesResult: " + res.toString());
}
catch(Exception e) {
logException(e); //calls context.logger.log on the stack trace string
}
}
instancesToStart
数组填充了像 i-0abcdef1234567890
这样的实例 ID。
我使用 CloudFormation 创建 Lambda 函数和所有必需的 IAM 角色等。下面是描述基于 Java 的 Lambda 函数的 role/permissions 的位:
Resources:
EC2SchedulerRole:
Type: 'AWS::IAM::Role'
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
Action:
- 'sts:AssumeRole'
Path: /
EC2SchedulerPolicy:
DependsOn:
- EC2SchedulerRole
Type: 'AWS::IAM::Policy'
Properties:
PolicyName: ec2-scheduler-role
Roles:
- !Ref EC2SchedulerRole
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action:
- 'logs:*'
Resource:
- 'arn:aws:logs:*:*:*'
- Effect: Allow
Action:
- 'ec2:DescribeInstanceAttribute'
- 'ec2:DescribeInstanceStatus'
- 'ec2:DescribeInstances'
- 'ec2:StartInstances'
- 'ec2:StopInstances'
- 'ec2:DeleteTags'
Resource:
- '*'
最终发生的事情是,根据第一个函数(记录实例状态转换的脚本)的 CloudWatch 日志,我们得到:
Received event:
{
"version": "0",
"id": "<guid>",
"detail-type": "EC2 Instance State-change Notification",
"source": "aws.ec2",
"account": "12345678",
"time": "2019-06-20T19:01:35Z",
"region": "us-east-1",
"resources": [
"arn:aws:ec2:us-east-1:12345678:instance/i-0abcdef12345678"
],
"detail": {
"instance-id": "i-0abcdef12345678",
"state": "pending"
}
}
Received event:
{
"version": "0",
"id": "<guid>",
"detail-type": "EC2 Instance State-change Notification",
"source": "aws.ec2",
"account": "12345678",
"time": "2019-06-20T19:01:37Z",
"region": "us-east-1",
"resources": [
"arn:aws:ec2:us-east-1:12345678:instance/i-0abcdef12345678"
],
"detail": {
"instance-id": "i-0abcdef12345678",
"state": "stopping"
}
}
Received event:
{
"version": "0",
"id": "<guid>",
"detail-type": "EC2 Instance State-change Notification",
"source": "aws.ec2",
"account": "12345678",
"time": "2019-06-20T19:01:37Z",
"region": "us-east-1",
"resources": [
"arn:aws:ec2:us-east-1:12345678:instance/i-0abcdef12345678"
],
"detail": {
"instance-id": "i-0abcdef12345678",
"state": "stopped"
}
}
根据 "worker" 函数(实际尝试启动实例的函数)的 CloudWatch 日志,我们得到:
StartInstancesRequest: {InstanceIds: [i-0abcdef12345678],}
StartInstancesResult: {StartingInstances: [{CurrentState: {Code: 0,Name: pending},InstanceId: i-0abcdef12345678,PreviousState: {Code: 80,Name: stopped}}]}
因此,从完成工作的基于 Java 的 Lambda 的角度来看,它正在做它需要做的所有事情,发出启动 EC2 实例的命令;但是当 EC2 实例尝试实际启动时,它会从 "pending" 到 "stopping" 再到 "stopped"。如果没有许可,它甚至不会走那么远,对吧?
如果是实例本身的问题(例如硬件),我预计使用 AWS 控制台手动启动它会失败。但它不会失败。手动启动成功!
发生了什么事?我该如何进一步诊断?是权限还是实例搞砸了?
我 99% 确定这 不是 由于 AZ 中可用容量不足,因为每当我尝试手动启动实例时它总是有效。这不是一个短暂的问题,也不是最近才发生的事情。这种情况已经持续了几个月,其中手动启动 100% 的时间有效,而基于脚本的启动在 0% 的时间有效。
试试这个策略,看看它是否有效。如果是,则政策有问题:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:*"
},
{
"Effect": "Allow",
"Action": [
"ec2:Start*",
"ec2:Stop*"
],
"Resource": "*"
}
]
}
启动 EBS 可能是问题所在。正如您所提到的,EC2 有 3 个采用 KMS 加密的 EBS 卷。您必须提供 KMS 权限 (kms:CreateGrant) 才能启动您的实例
{
"Sid": "GrantAccess",
"Effect": "Allow",
"Action": "kms:CreateGrant",
"Resource": "arn:aws:kms:::key/1234"
}