如何从 Prometheus 查询 API 延迟错误预算
How do I query an API latency Error Budget from Prometheus
我有一个普罗米修斯直方图,api_response_duration_seconds
,我有一个 SLO 定义为
histogram_quantile(0.95, sum(increase(api_response_duration_seconds_bucket[1m])) by (le)) <= 0.5
有没有一种简单的方法可以让我查询过去 28 天中有多少时间(百分比)该查询失败? 也就是说,我希望能够回答 "Has this query failed for more than 0.1% of the time for the past 28 days?".
所以这里的秘密是我想将一个范围向量转换为一个范围向量。这个isn't possible in Prometheus, but the workaround is to use a recording rule.
所以,需要做的是:
groups:
- name: SLOs
- rules:
- record: slo:api_response_duration_seconds:failing
expr: histogram_quantile(0.95, sum(increase(api_response_duration_seconds_bucket[1m])) by (le)) > 0.5
- record: slo:api_response_duration_seconds:all
expr: histogram_quantile(0.95, sum(increase(api_response_duration_seconds_bucket[1m])) by (le))
然后查询误差预算为
count_over_time(slo:api_response_duration_seconds:failing[28d])
/
count_over_time(slo:api_response_duration_seconds:all[28d])
我有一个普罗米修斯直方图,api_response_duration_seconds
,我有一个 SLO 定义为
histogram_quantile(0.95, sum(increase(api_response_duration_seconds_bucket[1m])) by (le)) <= 0.5
有没有一种简单的方法可以让我查询过去 28 天中有多少时间(百分比)该查询失败? 也就是说,我希望能够回答 "Has this query failed for more than 0.1% of the time for the past 28 days?".
所以这里的秘密是我想将一个范围向量转换为一个范围向量。这个isn't possible in Prometheus, but the workaround is to use a recording rule.
所以,需要做的是:
groups:
- name: SLOs
- rules:
- record: slo:api_response_duration_seconds:failing
expr: histogram_quantile(0.95, sum(increase(api_response_duration_seconds_bucket[1m])) by (le)) > 0.5
- record: slo:api_response_duration_seconds:all
expr: histogram_quantile(0.95, sum(increase(api_response_duration_seconds_bucket[1m])) by (le))
然后查询误差预算为
count_over_time(slo:api_response_duration_seconds:failing[28d])
/
count_over_time(slo:api_response_duration_seconds:all[28d])