如何优雅地避免在 Prometheus 中被零除

Question

有时您需要将一个指标除以另一个指标。

例如，我想计算这样的平均延迟：

rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
/
rate({__name__="hystrix_command_latency_total_seconds_count"}[60s])

如果指定时间段内没有activity，则除法器中的rate()变为0，除法结果变为NaN。如果我对结果进行一些聚合（avg() 或 sum() 或其他），整个聚合结果将变为 NaN.

所以我在分隔符中添加了零检查：

rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
/
(rate({__name__="hystrix_command_latency_total_seconds_count"}[60s]) > 0)

这将从结果向量中删除 NaNs。并且还将图表上的线撕成碎片。

让我们用 0 值标记 inactivity 的周期，使图形再次连续：

rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
/
(rate({__name__="hystrix_command_latency_total_seconds_count"}[60s]) > 0)
or
rate({__name__="hystrix_command_latency_total_seconds_count"}[60s]) > bool 0

这有效地将 NaNs 替换为 0，图形是连续的，聚合工作正常。

但是结果查询有点麻烦，尤其是当你需要做更多的标签过滤和对结果做一些聚合时。类似的东西：

avg(
    1000 * increase({__name__=~".*_hystrix_command_latency_total_seconds_sum", command_group=~"$commandGroup", command_name=~"$commandName", job=~"$service", instance=~"$instance"}[60s])
    /
    (increase({__name__=~".*_hystrix_command_latency_total_seconds_count", command_group=~"$commandGroup", command_name=~"$commandName", job=~"$service", instance=~"$instance"}[60s]) > 0)
    or
    increase({__name__=~".*_hystrix_command_latency_total_seconds_count", command_group=~"$commandGroup", command_name=~"$commandName", job=~"$service", instance=~"$instance"}[60s]) > bool 0
) by (command_group, command_name)

长话短说：有没有更简单的方法来处理除法器中的零？或者有什么常见的做法吗？

Answer 1

If there is no activity during the specified time period, the rate() in the divider becomes 0 and the result of division becomes NaN.

这是正确的行为，NaN 是您想要的结果。

aggregations work OK.

您不能汇总比率。需要分别对分子和分母求和再除。

所以：

   sum by (command_group, command_name)(rate(hystrix_command_latency_total_seconds_sum[5m]))
  /
   sum by (command_group, command_name)(rate(hystrix_command_latency_total_seconds_count[5m]))

Answer 2

终于有了针对我的具体问题的解决方案：

相差为零会导致 NaN 显示 - 这作为技术结果很好并且正确，但不是用户想要看到的（不满足业务要求）。

所以我搜索了一下，在grafana社区找到了解决我问题的“解决方案”：

用 max(YOUR_PROLEMATIC_QUERY, or vector(-1)) 包围您的问题值。然后，附加值映射会导致有用的输出。

（当然你必须根据你的问题调整解决方案... min/max... vector(42)/vector(101)/vector(...)）

更新(1)

好的。然而。根据查询，这似乎有点棘手。例如，由于除以零，我有另一个查询失败并返回 NaN。上述解决方案不起作用。我不得不用方括号括起查询并添加 > 0 or on() vector(100).

Answer 3

根据@eventhorizen 的回答，如果您有一个有时可以 return 为零的查询作为分母，它可能会弄乱图表并在没有数据的地方显示无穷大，您可以将结果限制在有效范围内.

例如，这个指标的输出应该在 0 和 1 之间，但是当没有数据时它也会产生 INFINITY：

(1/increase(SOMETIMES_ZERO_QUERY[1m]))

在这种情况下，您可以改为这样写，因此它显示 0 而不是大于 100 的值：

max((1/increase(SOMETIMES_ZERO_QUERY[1m]))<100 or on() vector(0))

或者如果你想 1 作为 INFINITY:

max((1/increase(SOMETIMES_ZERO_QUERY[1m]))<100 or on() vector(1))

Answer 4

我遇到了同样的问题，因此以这种方式实施了解决方案：-

increase({metric query}[2m]) / (increase({problematic zero giving metric query}[2m]))!=0 or on() vector(1) > 150

我必须检查分母是否给出 0，这反过来会给出无穷大，并且该图将是荒谬和荒谬的。因此，为了避免这种情况，设置条件 !=0 or vector(1) 以便万一分母随时变为 0，则其值将始终返回为 1。

Answer 5

只需在查询中添加> smallest_value，然后将其包装到聚合函数中，例如avg()，其中smallest_value是值，它小于内部的任何预期有效结果询问。例如：

avg((
  rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
  /
  rate({__name__="hystrix_command_latency_total_seconds_count"}[60s])
) > -1e12)

Prometheus 在使用 > 运算符将它们与任何数字进行比较时删除 NaN 值。例如，NaN >bool -1e12。这同样适用于 < 运算符，例如NaN <bool 1e12。因此 > 或 < 可用于过滤 NaN 值，然后再将它们与 aggregate functions.

聚合

P.S。 MetricsQL 中不需要这个技巧，因为 VictoriaMtrics 在对它们应用聚合函数时会自动跳过 NaN 值。

如何优雅地避免在 Prometheus 中被零除

How to gracefully avoid divide by zero in Prometheus

prometheus