AWS CloudWatch 警报,帮助解决错误 - 未检查:初始警报创建
AWS CloudWatch Alarm, Help Solving Error - Unchecked: Initial alarm creation
我的 scale down cloudwatch 警报一直处于 INSUFFICENT_DATA 状态。 cloudwatch 警报附加到我的自动缩放组。在过去的 3 天里,我的缩小警报一直处于这种状态,所以它肯定完成了初始化。
在我的警报中给出了
的原因
Unchecked: Initial alarm creation
这是来自 aws 文档:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
INSUFFICIENT_DATA—The alarm has just started, the metric is not available, or not enough data is available for the metric to determine the alarm state
这是我的 cloudformation 模板中的一个片段,它使 cloudwatch 发出警报。它采用对流层语法,但应该很容易阅读:
template.add_resource(Alarm(
"CPUUtilizationLowAlarm",
ActionsEnabled=True,
AlarmDescription="Scale down for average CPUUtilization <= 30%",
Namespace="AWS/EC2",
MetricName="CPUUtilization",
Statistic="Average",
Period="300",
EvaluationPeriods="3",
Threshold="30",
Unit="Percent",
ComparisonOperator="LessThanOrEqualToThreshold",
AlarmActions=[Ref("ScaleDownPolicy")],
Dimensions=[
MetricDimension(
Name="AutoscalingGroupName",
Value=Ref("AutoScalingGroup")
)
]
))
template.add_resource(ScalingPolicy(
"ScaleDownPolicy", #Simple reference value, nothing special
AdjustmentType="ChangeInCapacity", #Modify the asg capacity
AutoScalingGroupName=Ref("AutoScalingGroup"), #What asg to modify capacity
PolicyType="SimpleScaling", #Read about it here: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-scale-based-on-demand.html
Cooldown="1700", #How long to wait before checking status' again
ScalingAdjustment=Ref("DownscalingCount") #(Must be a negative number!!) How much to scale down
))
如您所见,我正在根据 CPUUtilization <= 30% 进行缩减。这是我所看到的有效指标。我已经阅读了这个堆栈溢出答案,但它似乎不适用于这种情况:
Amazon EC2 AutoScaling CPUUtilization Alarm- INSUFFICIENT DATA
我做几乎完全相同的事情,但使用 "Step Scaling" 而不是 "Simple Scaling",就像上面一样,但它实际上对我有用。这是我的 cloudformation 模板中的一个片段,用于我的步进缩放警报(放大):
template.add_resource(Alarm(
"CPUUtilizationHighAlarm",
ActionsEnabled=True,
AlarmDescription="Scale up for average CPUUtilization >= 50%",
MetricName="CPUUtilization",
Namespace="AWS/EC2",
Statistic="Average",
Period="300",
EvaluationPeriods="1",
Threshold="50",
Unit="Percent",
ComparisonOperator="GreaterThanOrEqualToThreshold",
AlarmActions=[Ref("ScaleUpPolicy")],
Dimensions=[
MetricDimension(
Name="AutoScalingGroupName",
Value=Ref("AutoScalingGroup")
)
]
))
template.add_resource(ScalingPolicy(
"ScaleUpPolicy",
AdjustmentType="ChangeInCapacity",
AutoScalingGroupName=Ref("AutoScalingGroup"), #What group to attach this to
EstimatedInstanceWarmup="1700", #How long it will take before instance is ready for traffic
MetricAggregationType="Average", #Breach if average is above threshold
PolicyType="StepScaling", #Read above step scaling here: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-scale-based-on-demand.html
StepAdjustments=[
StepAdjustments(
MetricIntervalLowerBound="0", #From 50 (Defined in alarm as 50% CPU)
MetricIntervalUpperBound="20", #To 70%
ScalingAdjustment="1" #Scale up 1 instance
),
StepAdjustments(
MetricIntervalLowerBound="20", #from 70%
MetricIntervalUpperBound="40", #to 90%
ScalingAdjustment="2" #Scale up 2 instances
),
StepAdjustments(
MetricIntervalLowerBound="40", #From 90% or above (Defined in alarm)
ScalingAdjustment="3" #Scale up 2 instances
)
]
))
我不知道我在缩小警报中配置错了什么。如果有人有任何建议或帮助,那就太好了。
我发现了问题...
错误在这里:
MetricDimension(
Name="AutoscalingGroupName",
Value=Ref("AutoScalingGroup")
)
Name
应该是 AutoScalingGroupName
而不是 AutoscalingGroupName
。它会尝试生成一个新的维度,而不是正确地从自动缩放组中提取。所以它不会抛出错误,并且会启动一切正常,它只是没有数据可以提取。因此会一直保持INSUFFICENT_DATA
状态,直到时间结束。
首都"S"...
谢谢,我在 Terraform 的维度字段中使用 InstanceID
而不是 InstanceId
时确实遇到了这个问题。从故障排除的角度来看,我创建了一个类似于我的 Terraform 参数的手动警报,然后使用 AWS CLI 命令(例如下面的命令)比较工作的 JSON 输出与 non-working 警报。
aws cloudwatch describe-alarms --alarm-names <ALARM_NAME> --region=us-east-1
另外使用
等命令
aws cloudwatch list-metrics --namespace AWS/ElasticMapReduce --metric-name CoreNodesRunning --query 'Metrics[0].Dimensions[].Name' --region=us-east-1
有助于确定可以使用哪些公制维度。
我的 scale down cloudwatch 警报一直处于 INSUFFICENT_DATA 状态。 cloudwatch 警报附加到我的自动缩放组。在过去的 3 天里,我的缩小警报一直处于这种状态,所以它肯定完成了初始化。
在我的警报中给出了
的原因Unchecked: Initial alarm creation
这是来自 aws 文档: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
INSUFFICIENT_DATA—The alarm has just started, the metric is not available, or not enough data is available for the metric to determine the alarm state
这是我的 cloudformation 模板中的一个片段,它使 cloudwatch 发出警报。它采用对流层语法,但应该很容易阅读:
template.add_resource(Alarm(
"CPUUtilizationLowAlarm",
ActionsEnabled=True,
AlarmDescription="Scale down for average CPUUtilization <= 30%",
Namespace="AWS/EC2",
MetricName="CPUUtilization",
Statistic="Average",
Period="300",
EvaluationPeriods="3",
Threshold="30",
Unit="Percent",
ComparisonOperator="LessThanOrEqualToThreshold",
AlarmActions=[Ref("ScaleDownPolicy")],
Dimensions=[
MetricDimension(
Name="AutoscalingGroupName",
Value=Ref("AutoScalingGroup")
)
]
))
template.add_resource(ScalingPolicy(
"ScaleDownPolicy", #Simple reference value, nothing special
AdjustmentType="ChangeInCapacity", #Modify the asg capacity
AutoScalingGroupName=Ref("AutoScalingGroup"), #What asg to modify capacity
PolicyType="SimpleScaling", #Read about it here: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-scale-based-on-demand.html
Cooldown="1700", #How long to wait before checking status' again
ScalingAdjustment=Ref("DownscalingCount") #(Must be a negative number!!) How much to scale down
))
如您所见,我正在根据 CPUUtilization <= 30% 进行缩减。这是我所看到的有效指标。我已经阅读了这个堆栈溢出答案,但它似乎不适用于这种情况: Amazon EC2 AutoScaling CPUUtilization Alarm- INSUFFICIENT DATA
我做几乎完全相同的事情,但使用 "Step Scaling" 而不是 "Simple Scaling",就像上面一样,但它实际上对我有用。这是我的 cloudformation 模板中的一个片段,用于我的步进缩放警报(放大):
template.add_resource(Alarm(
"CPUUtilizationHighAlarm",
ActionsEnabled=True,
AlarmDescription="Scale up for average CPUUtilization >= 50%",
MetricName="CPUUtilization",
Namespace="AWS/EC2",
Statistic="Average",
Period="300",
EvaluationPeriods="1",
Threshold="50",
Unit="Percent",
ComparisonOperator="GreaterThanOrEqualToThreshold",
AlarmActions=[Ref("ScaleUpPolicy")],
Dimensions=[
MetricDimension(
Name="AutoScalingGroupName",
Value=Ref("AutoScalingGroup")
)
]
))
template.add_resource(ScalingPolicy(
"ScaleUpPolicy",
AdjustmentType="ChangeInCapacity",
AutoScalingGroupName=Ref("AutoScalingGroup"), #What group to attach this to
EstimatedInstanceWarmup="1700", #How long it will take before instance is ready for traffic
MetricAggregationType="Average", #Breach if average is above threshold
PolicyType="StepScaling", #Read above step scaling here: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-scale-based-on-demand.html
StepAdjustments=[
StepAdjustments(
MetricIntervalLowerBound="0", #From 50 (Defined in alarm as 50% CPU)
MetricIntervalUpperBound="20", #To 70%
ScalingAdjustment="1" #Scale up 1 instance
),
StepAdjustments(
MetricIntervalLowerBound="20", #from 70%
MetricIntervalUpperBound="40", #to 90%
ScalingAdjustment="2" #Scale up 2 instances
),
StepAdjustments(
MetricIntervalLowerBound="40", #From 90% or above (Defined in alarm)
ScalingAdjustment="3" #Scale up 2 instances
)
]
))
我不知道我在缩小警报中配置错了什么。如果有人有任何建议或帮助,那就太好了。
我发现了问题...
错误在这里:
MetricDimension(
Name="AutoscalingGroupName",
Value=Ref("AutoScalingGroup")
)
Name
应该是 AutoScalingGroupName
而不是 AutoscalingGroupName
。它会尝试生成一个新的维度,而不是正确地从自动缩放组中提取。所以它不会抛出错误,并且会启动一切正常,它只是没有数据可以提取。因此会一直保持INSUFFICENT_DATA
状态,直到时间结束。
首都"S"...
谢谢,我在 Terraform 的维度字段中使用 InstanceID
而不是 InstanceId
时确实遇到了这个问题。从故障排除的角度来看,我创建了一个类似于我的 Terraform 参数的手动警报,然后使用 AWS CLI 命令(例如下面的命令)比较工作的 JSON 输出与 non-working 警报。
aws cloudwatch describe-alarms --alarm-names <ALARM_NAME> --region=us-east-1
另外使用
等命令aws cloudwatch list-metrics --namespace AWS/ElasticMapReduce --metric-name CoreNodesRunning --query 'Metrics[0].Dimensions[].Name' --region=us-east-1
有助于确定可以使用哪些公制维度。