为什么我的 AWS ECS 服务不能启动我的任务?
Why won't my AWS ECS service start my task?
我在使用 Terraform 在 AWS 中创建的新 AWS 负载均衡器和 AWS-ECS 存储库、集群和任务时遇到问题。一切都是在没有错误的情况下创建的。在单独的文件中有一些 IAM 角色和证书。这些是这里的相关定义。发生的情况是 ECS 服务正在创建任务,但任务在启动后立即关闭。我在 Cloudwatch 日志组中根本看不到任何日志。事实上,它甚至从未被创建。
当我第一次 运行 基础设施时,整个事情会失败 运行 对我来说很有意义,因为 ECS 存储库是全新的并且没有任何 Docker 图像推给它。但是我已经推送了图像并且服务再也没有启动过。我想它会在失败后无限循环尝试启动任务,但事实并非如此。
我已经通过销毁服务然后重新创建来强制它重新启动。考虑到 运行 现在有一个图像,我希望它能工作。它具有与初始启动相同的行为,即该服务创建一个无法启动的任务,没有原因日志,然后再也不会 运行s 任务。
有谁知道这有什么问题,或者我可能在哪里看到错误?
locals {
container_name = "tdweb-web-server-container"
}
resource "aws_lb" "web_server" {
name = "tdweb-alb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.lb_sg.id]
subnets = [
aws_subnet.subnet_a.id,
aws_subnet.subnet_b.id,
aws_subnet.subnet_c.id
]
}
resource "aws_security_group" "lb_sg" {
name = "ALB Security Group"
description = "Allows TLS inbound traffic"
vpc_id = aws_vpc.main.id
ingress {
description = "TLS from VPC"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_security_group" "web_server_service" {
name = "Web Sever Service Security Group"
description = "Allows HTTP inbound traffic"
vpc_id = aws_vpc.main.id
ingress {
description = "HTTP from VPC"
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_alb_listener" "https" {
load_balancer_arn = aws_lb.web_server.arn
port = 443
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-2016-08"
certificate_arn = aws_acm_certificate.main.arn
default_action {
target_group_arn = aws_lb_target_group.web_server.arn
type = "forward"
}
}
resource "random_string" "target_group_suffix" {
length = 4
upper = false
special = false
}
resource "aws_lb_target_group" "web_server" {
name = "web-server-target-group-${random_string.target_group_suffix.result}"
port = 80
protocol = "HTTP"
target_type = "ip"
vpc_id = aws_vpc.main.id
lifecycle {
create_before_destroy = true
}
}
resource "aws_iam_role" "web_server_task" {
name = "tdweb-web-server-task-role"
assume_role_policy = data.aws_iam_policy_document.web_server_task.json
}
data "aws_iam_policy_document" "web_server_task" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["ecs-tasks.amazonaws.com"]
}
}
}
resource "aws_iam_role_policy_attachment" "web_server_task" {
for_each = toset([
"arn:aws:iam::aws:policy/AmazonSQSFullAccess",
"arn:aws:iam::aws:policy/AmazonS3FullAccess",
"arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess",
"arn:aws:iam::aws:policy/AWSLambdaInvocation-DynamoDB"
])
role = aws_iam_role.web_server_task.name
policy_arn = each.value
}
resource "aws_ecr_repository" "web_server" {
name = "tdweb-web-server-repository"
}
resource "aws_ecs_cluster" "web_server" {
name = "tdweb-web-server-cluster"
}
resource "aws_ecs_task_definition" "web_server" {
family = "task_definition_name"
task_role_arn = aws_iam_role.web_server_task.arn
execution_role_arn = aws_iam_role.ecs_task_execution.arn
network_mode = "awsvpc"
cpu = "1024"
memory = "2048"
requires_compatibilities = ["FARGATE"]
container_definitions = <<DEFINITION
[
{
"name": "${local.container_name}",
"image": "${aws_ecr_repository.web_server.repository_url}:latest",
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/tdweb-task",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
},
"portMappings": [
{
"hostPort": 80,
"protocol": "tcp",
"containerPort": 80
}
],
"cpu": 0,
"essential": true
}
]
DEFINITION
}
resource "aws_ecs_service" "web_server" {
name = "tdweb-web-server-service"
cluster = aws_ecs_cluster.web_server.id
launch_type = "FARGATE"
task_definition = aws_ecs_task_definition.web_server.arn
desired_count = 1
load_balancer {
target_group_arn = aws_lb_target_group.web_server.arn
container_name = local.container_name
container_port = 80
}
network_configuration {
subnets = [
aws_subnet.subnet_a.id,
aws_subnet.subnet_b.id,
aws_subnet.subnet_c.id
]
assign_public_ip = true
security_groups = [aws_security_group.web_server_service.id]
}
}
编辑:为了回答评论,这里是 VPC 和子网
resource "aws_vpc" "main" {
cidr_block = "172.31.0.0/16"
}
resource "aws_subnet" "subnet_a" {
vpc_id = aws_vpc.main.id
availability_zone = "us-east-1a"
cidr_block = "172.31.0.0/20"
}
resource "aws_subnet" "subnet_b" {
vpc_id = aws_vpc.main.id
availability_zone = "us-east-1b"
cidr_block = "172.31.16.0/20"
}
resource "aws_subnet" "subnet_c" {
vpc_id = aws_vpc.main.id
availability_zone = "us-east-1c"
cidr_block = "172.31.32.0/20"
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
}
编辑:这是一个有点启发性的更新。我发现这个错误不是在任务日志中,而是在任务中的容器日志中。我从来不知道它在那里。
Status reason CannotPullContainerError: Error response from daemon:
Get https://563407091361.dkr.ecr.us-east-1.amazonaws.com/v2/:
net/http: request canceled while waiting for connection
(Client.Timeout exceeded while awaiting headers)
服务似乎无法从 ECR 存储库中提取容器。在阅读之后,我还不知道如何解决这个问题。我还在四处寻找。
根据评论,一个可能的问题是子集中无法访问互联网。这可以纠正如下:
# Route table to connect to Internet Gateway
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
}
resource "aws_route_table_association" "subnet_public_a" {
subnet_id = aws_subnet.subnet_a.id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "subnet_public_b" {
subnet_id = aws_subnet.subnet_b.id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "subnet_public_c" {
subnet_id = aws_subnet.subnet_c.id
route_table_id = aws_route_table.public.id
}
您也可以将 depends_on
添加到您的 aws_ecs_service
以便它
等待这些附件完成。
关联的较短替代方案:
locals {
subnets = [aws_subnet.subnet_a.id,
aws_subnet.subnet_b.id,
aws_subnet.subnet_c.id]
}
resource "aws_route_table_association" "subnet_public_b" {
count = length(local.subnets)
subnet_id = local.subnets[count.index]
route_table_id = aws_route_table.public.id
}
我在使用 Terraform 在 AWS 中创建的新 AWS 负载均衡器和 AWS-ECS 存储库、集群和任务时遇到问题。一切都是在没有错误的情况下创建的。在单独的文件中有一些 IAM 角色和证书。这些是这里的相关定义。发生的情况是 ECS 服务正在创建任务,但任务在启动后立即关闭。我在 Cloudwatch 日志组中根本看不到任何日志。事实上,它甚至从未被创建。
当我第一次 运行 基础设施时,整个事情会失败 运行 对我来说很有意义,因为 ECS 存储库是全新的并且没有任何 Docker 图像推给它。但是我已经推送了图像并且服务再也没有启动过。我想它会在失败后无限循环尝试启动任务,但事实并非如此。
我已经通过销毁服务然后重新创建来强制它重新启动。考虑到 运行 现在有一个图像,我希望它能工作。它具有与初始启动相同的行为,即该服务创建一个无法启动的任务,没有原因日志,然后再也不会 运行s 任务。
有谁知道这有什么问题,或者我可能在哪里看到错误?
locals {
container_name = "tdweb-web-server-container"
}
resource "aws_lb" "web_server" {
name = "tdweb-alb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.lb_sg.id]
subnets = [
aws_subnet.subnet_a.id,
aws_subnet.subnet_b.id,
aws_subnet.subnet_c.id
]
}
resource "aws_security_group" "lb_sg" {
name = "ALB Security Group"
description = "Allows TLS inbound traffic"
vpc_id = aws_vpc.main.id
ingress {
description = "TLS from VPC"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_security_group" "web_server_service" {
name = "Web Sever Service Security Group"
description = "Allows HTTP inbound traffic"
vpc_id = aws_vpc.main.id
ingress {
description = "HTTP from VPC"
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_alb_listener" "https" {
load_balancer_arn = aws_lb.web_server.arn
port = 443
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-2016-08"
certificate_arn = aws_acm_certificate.main.arn
default_action {
target_group_arn = aws_lb_target_group.web_server.arn
type = "forward"
}
}
resource "random_string" "target_group_suffix" {
length = 4
upper = false
special = false
}
resource "aws_lb_target_group" "web_server" {
name = "web-server-target-group-${random_string.target_group_suffix.result}"
port = 80
protocol = "HTTP"
target_type = "ip"
vpc_id = aws_vpc.main.id
lifecycle {
create_before_destroy = true
}
}
resource "aws_iam_role" "web_server_task" {
name = "tdweb-web-server-task-role"
assume_role_policy = data.aws_iam_policy_document.web_server_task.json
}
data "aws_iam_policy_document" "web_server_task" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["ecs-tasks.amazonaws.com"]
}
}
}
resource "aws_iam_role_policy_attachment" "web_server_task" {
for_each = toset([
"arn:aws:iam::aws:policy/AmazonSQSFullAccess",
"arn:aws:iam::aws:policy/AmazonS3FullAccess",
"arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess",
"arn:aws:iam::aws:policy/AWSLambdaInvocation-DynamoDB"
])
role = aws_iam_role.web_server_task.name
policy_arn = each.value
}
resource "aws_ecr_repository" "web_server" {
name = "tdweb-web-server-repository"
}
resource "aws_ecs_cluster" "web_server" {
name = "tdweb-web-server-cluster"
}
resource "aws_ecs_task_definition" "web_server" {
family = "task_definition_name"
task_role_arn = aws_iam_role.web_server_task.arn
execution_role_arn = aws_iam_role.ecs_task_execution.arn
network_mode = "awsvpc"
cpu = "1024"
memory = "2048"
requires_compatibilities = ["FARGATE"]
container_definitions = <<DEFINITION
[
{
"name": "${local.container_name}",
"image": "${aws_ecr_repository.web_server.repository_url}:latest",
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/tdweb-task",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
},
"portMappings": [
{
"hostPort": 80,
"protocol": "tcp",
"containerPort": 80
}
],
"cpu": 0,
"essential": true
}
]
DEFINITION
}
resource "aws_ecs_service" "web_server" {
name = "tdweb-web-server-service"
cluster = aws_ecs_cluster.web_server.id
launch_type = "FARGATE"
task_definition = aws_ecs_task_definition.web_server.arn
desired_count = 1
load_balancer {
target_group_arn = aws_lb_target_group.web_server.arn
container_name = local.container_name
container_port = 80
}
network_configuration {
subnets = [
aws_subnet.subnet_a.id,
aws_subnet.subnet_b.id,
aws_subnet.subnet_c.id
]
assign_public_ip = true
security_groups = [aws_security_group.web_server_service.id]
}
}
编辑:为了回答评论,这里是 VPC 和子网
resource "aws_vpc" "main" {
cidr_block = "172.31.0.0/16"
}
resource "aws_subnet" "subnet_a" {
vpc_id = aws_vpc.main.id
availability_zone = "us-east-1a"
cidr_block = "172.31.0.0/20"
}
resource "aws_subnet" "subnet_b" {
vpc_id = aws_vpc.main.id
availability_zone = "us-east-1b"
cidr_block = "172.31.16.0/20"
}
resource "aws_subnet" "subnet_c" {
vpc_id = aws_vpc.main.id
availability_zone = "us-east-1c"
cidr_block = "172.31.32.0/20"
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
}
编辑:这是一个有点启发性的更新。我发现这个错误不是在任务日志中,而是在任务中的容器日志中。我从来不知道它在那里。
Status reason CannotPullContainerError: Error response from daemon: Get https://563407091361.dkr.ecr.us-east-1.amazonaws.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
服务似乎无法从 ECR 存储库中提取容器。在阅读之后,我还不知道如何解决这个问题。我还在四处寻找。
根据评论,一个可能的问题是子集中无法访问互联网。这可以纠正如下:
# Route table to connect to Internet Gateway
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
}
resource "aws_route_table_association" "subnet_public_a" {
subnet_id = aws_subnet.subnet_a.id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "subnet_public_b" {
subnet_id = aws_subnet.subnet_b.id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "subnet_public_c" {
subnet_id = aws_subnet.subnet_c.id
route_table_id = aws_route_table.public.id
}
您也可以将 depends_on
添加到您的 aws_ecs_service
以便它
等待这些附件完成。
关联的较短替代方案:
locals {
subnets = [aws_subnet.subnet_a.id,
aws_subnet.subnet_b.id,
aws_subnet.subnet_c.id]
}
resource "aws_route_table_association" "subnet_public_b" {
count = length(local.subnets)
subnet_id = local.subnets[count.index]
route_table_id = aws_route_table.public.id
}