为什么从 WebServer01 到 DBServer 的 ping 超时?

Why is ping timing out from WebServer01 to DBServer?

我使用下面的 Terraform 代码创建了自定义 VPC。

我应用了下面的 Terraform 模块,其中:

  1. 添加 Public 和私有子网
  2. 配置 IGW
  3. 添加 NAT 网关
  4. 为 Web 和数据库服务器添加 SG

应用此后,我无法从 public ec2 实例 ping/ssh 到私有 ec2 实例。

不确定缺少什么。

# Custom VPC
resource "aws_vpc" "MyVPC" {
  cidr_block       = "10.0.0.0/16"
  instance_tenancy = "default" # For Prod use "dedicated"

  tags = {
    Name = "MyVPC"
  }
}

# Creates "Main Route Table", "NACL" & "default Security Group"

# Create Public Subnet, Associate with our VPC, Auto assign Public IP
resource "aws_subnet" "PublicSubNet" {
  vpc_id                  = aws_vpc.MyVPC.id # Our VPC
  availability_zone       = "eu-west-2a"     # AZ within London, 1 Subnet = 1 AZ
  cidr_block              = "10.0.1.0/24"    #  Check using this later > "${cidrsubnet(data.aws_vpc.MyVPC.cidr_block, 4, 1)}"
  map_public_ip_on_launch = "true"           # Auto assign Public IP for Public Subnet
  tags = {
    Name = "PublicSubNet"
  }
}

# Create Private Subnet, Associate with our VPC
resource "aws_subnet" "PrivateSubNet" {
  vpc_id            = aws_vpc.MyVPC.id # Our VPC
  availability_zone = "eu-west-2b"     # AZ within London region, 1 Subnet = 1 AZ
  cidr_block        = "10.0.2.0/24"    #  Check using this later > "${cidrsubnet(data.aws_vpc.MyVPC.cidr_block, 4, 1)}"
  tags = {
    Name = "PrivateSubNet"
  }
}

# Only 1 IGW per VPC
resource "aws_internet_gateway" "MyIGW" {
  vpc_id = aws_vpc.MyVPC.id
  tags = {
    Name = "MyIGW"
  }
}

# New Public route table, so we can keep "default main" route table as Private. Route out to MyIGW
resource "aws_route_table" "MyPublicRouteTable" {
  vpc_id = aws_vpc.MyVPC.id # Our VPC

  route {                    # Route out IPV4
    cidr_block = "0.0.0.0/0" # IPV4 Route Out for all
    # ipv6_cidr_block = "::/0"        The parameter destinationCidrBlock cannot be used with the parameter destinationIpv6CidrBlock # IPV6 Route Out for all
    gateway_id = aws_internet_gateway.MyIGW.id # Target : Internet Gateway created earlier
  }
  route {                                           # Route out IPV6
    ipv6_cidr_block = "::/0"                        # IPV6 Route Out for all
    gateway_id      = aws_internet_gateway.MyIGW.id # Target : Internet Gateway created earlier
  }
  tags = {
    Name = "MyPublicRouteTable"
  }
}

# Associate "PublicSubNet" with the public route table created above, removes it from default main route table
resource "aws_route_table_association" "PublicSubNetnPublicRouteTable" {
  subnet_id      = aws_subnet.PublicSubNet.id
  route_table_id = aws_route_table.MyPublicRouteTable.id
}

# Create new security group "WebDMZ" for WebServer
resource "aws_security_group" "WebDMZ" {
  name        = "WebDMZ"
  description = "Allows SSH & HTTP requests"
  vpc_id      = aws_vpc.MyVPC.id # Our VPC : SGs cannot span VPC

  ingress {
    description = "Allows SSH requests for VPC: IPV4"
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]   # SSH restricted to my laptop public IP <My PUBLIC IP>/32
  }
  ingress {
    description = "Allows HTTP requests for VPC: IPV4"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]   # You can use Load Balancer
  }
  ingress {
    description      = "Allows HTTP requests for VPC: IPV6"
    from_port        = 80
    to_port          = 80
    protocol         = "tcp"
    ipv6_cidr_blocks = ["::/0"]
  }
  egress {
    description = "Allows SSH requests for VPC: IPV4"
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]   # SSH restricted to my laptop public IP <My PUBLIC IP>/32
  }
  egress {
    description = "Allows HTTP requests for VPC: IPV4"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  egress {
    description      = "Allows HTTP requests for VPC: IPV6"
    from_port        = 80
    to_port          = 80
    protocol         = "tcp"
    ipv6_cidr_blocks = ["::/0"]
  }
}

# Create new EC2 instance (WebServer01) in Public Subnet
# Get ami id from Console
resource "aws_instance" "WebServer01" {
  ami           = "ami-01a6e31ac994bbc09"
  instance_type = "t2.micro"
  subnet_id     = aws_subnet.PublicSubNet.id
  key_name = "MyEC2KeyPair"   # To connect using key pair
  security_groups = [aws_security_group.WebDMZ.id]    # Assign WebDMZ security group created above
  # vpc_security_group_ids = [aws_security_group.WebDMZ.id]
  tags = {
    Name = "WebServer01"
  }
}

# Create new security group "MyDBSG" for WebServer
resource "aws_security_group" "MyDBSG" {
  name        = "MyDBSG"
  description = "Allows Public WebServer to Communicate with Private DB Server"
  vpc_id      = aws_vpc.MyVPC.id # Our VPC : SGs cannot span VPC

  ingress {
    description = "Allows ICMP requests: IPV4" # For ping,communication, error reporting etc
    from_port   = -1
    to_port     = -1
    protocol    = "icmp"
    cidr_blocks = ["10.0.1.0/24"]    # Public Subnet CIDR block, can be "WebDMZ" security group id too as below
    security_groups = [aws_security_group.WebDMZ.id]        # Tried this as above was not working, but still doesn't work
  }
  ingress {
    description      = "Allows SSH requests: IPV4" # You can SSH from WebServer01 to DBServer, using DBServer private ip address and copying private key to WebServer
    from_port        = 22                          # ssh ec2-user@Private Ip Address -i MyPvKey.pem     Private Key pasted in MyPvKey.pem
    to_port          = 22                          # Not a good practise to use store private key on WebServer, instead use Bastion Host (Hardened Image, Secure) to connect to Private DB
    protocol         = "tcp"
    cidr_blocks = ["10.0.1.0/24"]
  }
  ingress {
    description = "Allows HTTP requests: IPV4"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["10.0.1.0/24"]
  }
  ingress {
    description      = "Allows HTTPS requests : IPV4"
    from_port        = 443
    to_port          = 443
    protocol         = "tcp"
    cidr_blocks = ["10.0.1.0/24"]
  }
  ingress {
    description      = "Allows MySQL/Aurora requests"
    from_port        = 3306
    to_port          = 3306
    protocol         = "tcp"
    cidr_blocks = ["10.0.1.0/24"]
  }
  egress {
    description = "Allows ICMP requests: IPV4" # For ping,communication, error reporting etc
    from_port   = -1
    to_port     = -1
    protocol    = "icmp"
    cidr_blocks = ["10.0.1.0/24"] # Public Subnet CIDR block, can be "WebDMZ" security group id too
  }
  egress {
    description      = "Allows SSH requests: IPV4" # You can SSH from WebServer01 to DBServer, using DBServer private ip address and copying private key to WebServer
    from_port        = 22                          # ssh ec2-user@Private Ip Address -i MyPvtKey.pem     Private Key pasted in MyPvKey.pem chmod 400 MyPvtKey.pem
    to_port          = 22                          # Not a good practise to use store private key on WebServer, instead use Bastion Host (Hardened Image, Secure) to connect to Private DB
    protocol         = "tcp"
    cidr_blocks = ["10.0.1.0/24"]
  }
  egress {
    description = "Allows HTTP requests: IPV4"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["10.0.1.0/24"]
  }
  egress {
    description      = "Allows HTTPS requests : IPV4"
    from_port        = 443
    to_port          = 443
    protocol         = "tcp"
    cidr_blocks = ["10.0.1.0/24"]
  }
  egress {
    description      = "Allows MySQL/Aurora requests"
    from_port        = 3306
    to_port          = 3306
    protocol         = "tcp"
    cidr_blocks = ["10.0.1.0/24"]
  }
}


# Create new EC2 instance (DBServer) in Private Subnet, Associate "MyDBSG" Security Group
resource "aws_instance" "DBServer" {
  ami           = "ami-01a6e31ac994bbc09"
  instance_type = "t2.micro"
  subnet_id     = aws_subnet.PrivateSubNet.id
  key_name = "MyEC2KeyPair"   # To connect using key pair
  security_groups = [aws_security_group.MyDBSG.id] # THIS WAS GIVING ERROR WHEN ASSOCIATING
  # vpc_security_group_ids = [aws_security_group.MyDBSG.id]
  tags = {
    Name = "DBServer"
  }
}

# Elastic IP required for NAT Gateway
resource "aws_eip" "nateip" {
  vpc = true
  tags = {
    Name = "NATEIP"
  }
}

# DBServer in private subnet cannot access internet, so add "NAT Gateway" in Public Subnet
# Add Target as NAT Gateway in default main route table. So there is route out to Internet.
# Now you can do yum update on DBServer

resource "aws_nat_gateway" "NATGW" {         # Create NAT Gateway in each AZ so in case of failure it can use other
  allocation_id = aws_eip.nateip.id          # Elastic IP allocation
  subnet_id     = aws_subnet.PublicSubNet.id # Public Subnet

  tags = {
    Name = "NATGW"
  }
}

# Main Route Table add NATGW as Target

resource "aws_default_route_table" "DefaultRouteTable" {
  default_route_table_id = aws_vpc.MyVPC.default_route_table_id

  route {
    cidr_block     = "0.0.0.0/0"              # IPV4 Route Out for all
    nat_gateway_id = aws_nat_gateway.NATGW.id # Target : NAT Gateway created above
  }

  tags = {
    Name = "DefaultRouteTable"
  }
}

为什么从 WebServer01 到 DBServer 的 ping 超时?

没有特定的 NACL,默认的 NACL 是完全开放的,因此它们在这里不相关。

为此,DBServer 上的安全组需要允许出口到 DBServer 的安全组或包含以下内容的 CIDR它。

aws_instance.DBServer 使用 aws_security_group.MyDBSG.

aws_instance.WebServer01 使用 aws_security_group.WebDMZ.

aws_security_group.WebDMZ的出口规则如下:

  egress {
    description = "Allows SSH requests for VPC: IPV4"
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]   # SSH restricted to my laptop public IP <My PUBLIC IP>/32
  }
  egress {
    description = "Allows HTTP requests for VPC: IPV4"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  egress {
    description      = "Allows HTTP requests for VPC: IPV6"
    from_port        = 80
    to_port          = 80
    protocol         = "tcp"
    ipv6_cidr_blocks = ["::/0"]
  }

这些出口规则意味着:

  1. 允许 SSH 到所有 IPv4 主机
  2. 允许所有 IPv4 和 IPv6 主机的 HTTP

ICMP 未列出,因此 ICMP 回显请求在离开 aws_security_group.WebDMZ 之前将被丢弃。对于 aws_instance.WebServer01.

的 ENI,这应该在 VPC FlowLog 中显示为 REJECT

将这个 egress 规则添加到 aws_security_group.WebDMZ 应该可以解决这个问题:

  egress {
    description      = "Allows ICMP requests: IPV4" # For ping,communication, error reporting etc
    from_port        = -1
    to_port          = -1
    protocol         = "icmp"
    cidr_blocks      = ["10.0.2.0/24"]
  }

DBServer 可能不会响应 ICMP,因此在进行此更改后您可能仍会看到超时。参考 VPC FlowLog 将有助于确定差异。如果您在 VPC FlowLog 中看到 ICMP 流的 ACCEPT,则问题是 DBServer 没有响应 ICMP。

aws_security_group.WebDMZ 中没有任何内容阻止 SSH,所以问题一定出在其他地方。

aws_security_group.MyDBSG上的入口规则如下。

  ingress {
    description = "Allows ICMP requests: IPV4" # For ping,communication, error reporting etc
    from_port   = -1
    to_port     = -1
    protocol    = "icmp"
    cidr_blocks = ["10.0.1.0/24"]    # Public Subnet CIDR block, can be "WebDMZ" security group id too as below
    security_groups = [aws_security_group.WebDMZ.id]        # Tried this as above was not working, but still doesn't work
  }
  ingress {
    description      = "Allows SSH requests: IPV4" # You can SSH from WebServer01 to DBServer, using DBServer private ip address and copying private key to WebServer
    from_port        = 22                          # ssh ec2-user@Private Ip Address -i MyPvKey.pem     Private Key pasted in MyPvKey.pem
    to_port          = 22                          # Not a good practise to use store private key on WebServer, instead use Bastion Host (Hardened Image, Secure) to connect to Private DB
    protocol         = "tcp"
    cidr_blocks = ["10.0.1.0/24"]
  }

这些出口规则意味着:

  1. 允许来自 Public 子网 (10.0.1.0/24)
  2. 的所有 ICMP
  3. 允许来自 Public 子网 (10.0.1.0/24) 的 SSH

SSH 应该可以正常工作。 DBServer 可能不接受 SSH 连接。

假设您的 DBServer 是 10.0.2.123,那么如果 SSH 不是 运行 它看起来像 WebServer01 运行 ssh -v 10.0.2.123:

debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 47: Applying options for *
debug1: Connecting to 10.0.2.123 [10.0.2.123] port 22.
debug1: connect to address 10.0.2.123 port 22: Operation timed out
ssh: connect to host 10.0.2.123 port 22: Operation timed out

参考 VPC FlowLog 将有助于确定差异。如果您在 VPC FlowLog 中看到 SSH 流的 ACCEPT,则问题是 DBServer 不接受 SSH 连接。

启用 VPC 流日志

由于这些年来已经发生了几次变化,我将向您指出 AWS' own maintained, step by step guide for creating a flow log that publishes to CloudWatch Logs

我目前推荐 CloudWatch 而不是 S3,因为与使用 Athena 设置和查询 S3 相比,使用 CloudWatch Logs Insights 进行查询非常容易。您只需选择用于流日志的 CloudWatch 日志流并搜索感兴趣的 IP 或端口。

此示例 CloudWatch Logs Insights 查询将获得 eni-0123456789abcdef0 上最近的 20 个拒绝(不是真正的 ENI,使用您正在调试的实际 ENI ID):

fields @timestamp,@message
| sort @timestamp desc
| filter @message like 'eni-0123456789abcdef0'
| filter @message like 'REJECT'
| limit 20

在 VPC 流日志中,丢失的出口规则在源 ENI 上显示为拒绝。

在 VPC 流日志中,丢失的入口规则在目标 ENI 上显示为 REJECT。

安全组是有状态的

无状态数据包过滤器要求您处理神秘的事情,例如大多数(不是所有)操作系统使用端口 32767-65535 来处理 TCP 流中的回复流量。这是一个痛苦,它使 NACL(无状态)成为一个巨大的痛苦。

像安全组这样的有状态防火墙会自动跟踪连接(有状态的状态),因此您只需要在源 SG 的出口规则中允许服务端口到目标 SG 或 IP CIDR 块以及来自源 SG 或目标 SG 的入口规则中的 IP CIDR 块。

尽管安全组 (SG) 是有状态的,但它们默认为拒绝。这包括来自源的出站发起流量。如果源 SG 不允许流量出站到目的地,则即使目的地有允许它的 SG,也是不允许的。这是一个普遍的误解。 SG规则是不可传递的,需要双方制定。

AWS's Protecting Your Instance with Security Groups video 很好地直观地解释了这一点。

旁白:风格推荐

您应该使用 aws_security_group_rule 资源而不是内联的出站和入站规则。

NOTE on Security Groups and Security Group Rules: Terraform currently provides both a standalone Security Group Rule resource (a single ingress or egress rule), and a Security Group resource with ingress and egress rules defined in-line. At this time you cannot use a Security Group with in-line rules in conjunction with any Security Group Rule resources. Doing so will cause a conflict of rule settings and will overwrite rules.

使用带有内联规则的旧式 aws_security_group

resource "aws_security_group" "WebDMZ" {
  name        = "WebDMZ"
  description = "Allows SSH & HTTP requests"
  vpc_id      = aws_vpc.MyVPC.id

  ingress {
    description = "Allows HTTP requests for VPC: IPV4"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]   # You can use Load Balancer
  }

  egress {
    description = "Allows SSH requests for VPC: IPV4"
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

并将其替换为这种现代风格的 aws_security_group,每个规则都有 aws_security_group_rule 资源:

resource "aws_security_group" "WebDMZ" {
  name        = "WebDMZ"
  description = "Allows SSH & HTTP requests"
  vpc_id      = aws_vpc.MyVPC.id
}

resource "aws_security_group_rule" "WebDMZ_HTTP_in" {
  security_group_id = aws_security_group.WebDMZ.id

  type        = "ingress"
  description = "Allows HTTP requests for VPC: IPV4"
  from_port   = 80
  to_port     = 80
  protocol    = "tcp"
  cidr_blocks = ["0.0.0.0/0"]
}

resource "aws_security_group_rule" "WebDMZ_SSH_out" {
  security_group_id = aws_security_group.WebDMZ.id

  type        = "egress"
  description = "Allows SSH requests for VPC: IPV4"
  from_port   = 22
  to_port     = 22
  protocol    = "tcp"
  cidr_blocks = ["0.0.0.0/0"]
}

允许下面粘贴的 "WebDMZ" 安全组的所有出站规则解决了问题。

egress {  # Allow allow traffic outbound, THIS WAS THE REASON YOU WAS NOT ABLE TO PING FROM WebServer to DBServer
    description = "Allows All Traffic Outbound from Web Server" 
    from_port   = 0
    to_port     = 0
    protocol    = -1
    cidr_blocks = ["0.0.0.0/0"] 
  }