AWS CloudWatch 警报,帮助解决错误 - 未检查:初始警报创建

AWS CloudWatch Alarm, Help Solving Error - Unchecked: Initial alarm creation

我的 scale down cloudwatch 警报一直处于 INSUFFICENT_DATA 状态。 cloudwatch 警报附加到我的自动缩放组。在过去的 3 天里,我的缩小警报一直处于这种状态,所以它肯定完成了初始化。

在我的警报中给出了

的原因

Unchecked: Initial alarm creation

这是来自 aws 文档: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html

INSUFFICIENT_DATA—The alarm has just started, the metric is not available, or not enough data is available for the metric to determine the alarm state

这是我的 cloudformation 模板中的一个片段,它使 cloudwatch 发出警报。它采用对流层语法,但应该很容易阅读:

template.add_resource(Alarm(
    "CPUUtilizationLowAlarm",
    ActionsEnabled=True,
    AlarmDescription="Scale down for average CPUUtilization <= 30%",
    Namespace="AWS/EC2",
    MetricName="CPUUtilization",
    Statistic="Average",
    Period="300",
    EvaluationPeriods="3",
    Threshold="30",
    Unit="Percent",
    ComparisonOperator="LessThanOrEqualToThreshold",
    AlarmActions=[Ref("ScaleDownPolicy")],
    Dimensions=[
        MetricDimension(
            Name="AutoscalingGroupName",
            Value=Ref("AutoScalingGroup")
        )
    ]
))
template.add_resource(ScalingPolicy(
    "ScaleDownPolicy",                                                      #Simple reference value, nothing special
    AdjustmentType="ChangeInCapacity",                                      #Modify the asg capacity
    AutoScalingGroupName=Ref("AutoScalingGroup"),                           #What asg to modify capacity
    PolicyType="SimpleScaling",                                             #Read about it here: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-scale-based-on-demand.html
    Cooldown="1700",                                                        #How long to wait before checking status' again
    ScalingAdjustment=Ref("DownscalingCount")                               #(Must be a negative number!!) How much to scale down
))

如您所见,我正在根据 CPUUtilization <= 30% 进行缩减。这是我所看到的有效指标。我已经阅读了这个堆栈溢出答案,但它似乎不适用于这种情况: Amazon EC2 AutoScaling CPUUtilization Alarm- INSUFFICIENT DATA

我做几乎完全相同的事情,但使用 "Step Scaling" 而不是 "Simple Scaling",就像上面一样,但它实际上对我有用。这是我的 cloudformation 模板中的一个片段,用于我的步进缩放警报(放大):

template.add_resource(Alarm(
    "CPUUtilizationHighAlarm",
    ActionsEnabled=True,
    AlarmDescription="Scale up for average CPUUtilization >= 50%",
    MetricName="CPUUtilization",
    Namespace="AWS/EC2",
    Statistic="Average",
    Period="300",
    EvaluationPeriods="1",
    Threshold="50",
    Unit="Percent",
    ComparisonOperator="GreaterThanOrEqualToThreshold",
    AlarmActions=[Ref("ScaleUpPolicy")],
    Dimensions=[
        MetricDimension(
            Name="AutoScalingGroupName",
            Value=Ref("AutoScalingGroup")
        )
    ]
))
template.add_resource(ScalingPolicy(
    "ScaleUpPolicy",
    AdjustmentType="ChangeInCapacity",
    AutoScalingGroupName=Ref("AutoScalingGroup"),                           #What group to attach this to
    EstimatedInstanceWarmup="1700",                                         #How long it will take before instance is ready for traffic
    MetricAggregationType="Average",                                        #Breach if average is above threshold
    PolicyType="StepScaling",                                               #Read above step scaling here: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-scale-based-on-demand.html
    StepAdjustments=[                                                       
        StepAdjustments(
            MetricIntervalLowerBound="0",                                   #From 50 (Defined in alarm as 50% CPU)
            MetricIntervalUpperBound="20",                                  #To 70%
            ScalingAdjustment="1"                                           #Scale up 1 instance
        ),
        StepAdjustments(
            MetricIntervalLowerBound="20",                                  #from 70%
            MetricIntervalUpperBound="40",                                  #to 90%
            ScalingAdjustment="2"                                           #Scale up 2 instances
        ),
        StepAdjustments(
            MetricIntervalLowerBound="40",                                  #From 90% or above (Defined in alarm)
            ScalingAdjustment="3"                                           #Scale up 2 instances
        )
    ]
))

我不知道我在缩小警报中配置错了什么。如果有人有任何建议或帮助,那就太好了。

我发现了问题...

错误在这里:

MetricDimension(
        Name="AutoscalingGroupName",
        Value=Ref("AutoScalingGroup")
    )

Name 应该是 AutoScalingGroupName 而不是 AutoscalingGroupName。它会尝试生成一个新的维度,而不是正确地从自动缩放组中提取。所以它不会抛出错误,并且会启动一切正常,它只是没有数据可以提取。因此会一直保持INSUFFICENT_DATA状态,直到时间结束。

首都"S"...

谢谢,我在 Terraform 的维度字段中使用 InstanceID 而不是 InstanceId 时确实遇到了这个问题。从故障排除的角度来看,我创建了一个类似于我的 Terraform 参数的手动警报,然后使用 AWS CLI 命令(例如下面的命令)比较工作的 JSON 输出与 non-working 警报。

aws cloudwatch describe-alarms --alarm-names <ALARM_NAME> --region=us-east-1

另外使用

等命令

aws cloudwatch list-metrics --namespace AWS/ElasticMapReduce --metric-name CoreNodesRunning --query 'Metrics[0].Dimensions[].Name' --region=us-east-1

有助于确定可以使用哪些公制维度。