Prometheus增加不处理进程重启

Question

我试图弄清楚 Prometheus 的 increase() 查询函数在进程重启时的行为。

当2m间隔内有进程重启，我查询：

sum(increase(my_metric_total[2m]))

我得到的值低于预期。

例如，在一个简单的实验中我模拟了：

3 lcm_restarts
1 个进程重启
2 lcm_restarts

全部在 2 分钟间隔内。

查询后：

sum(increase(lcm_restarts[2m]))

当我期望 5 时，我收到了 ~4.5 的值。

lcm_restarts graph

sum(increase(lcm_restarts[2m])) result

有人可以解释一下吗？

Answer 1

这里的第一个问题非常简洁且准备充分。请保持这种精神！

使用计数器时，函数 rate()、irate() 和 increase() 正在调整因重启而导致的重置。与名称不同，increase() 函数不计算给定时间范围内的绝对增长，而是另一种写法 rate(metric[interval]) * number_of_seconds_in_interval。 rate() 函数采用系列中的第一个和最后一个测量值，并计算给定时间内每秒的增加量。这就是为什么您可能会观察到非整数增加的原因，即使您总是以整数增加，因为测量值几乎永远不会恰好在间隔的开始和结束时。

有关此的更多详细信息，请查看 prometheus docs for the increase() function. There are also some good hints on what and what not to do when working with counters in the robust perception blog。

查看您的标签尺寸后，我还认为计数器重置不适用于您构建的示例。有一个名为 reason 的标签在重新启动之间发生了变化，因此创建了第二个时间序列（不继续现有的时间序列）。在这里，您基本上还总结了两个不同时间序列的增长率（就其本身而言）都发生了外推。

所以基本上您所做的事情并没有错，您只是不应该依赖于从 prometheus 中为您的用例获取高精度数字。

Answer 2

Prometheus 可能 return 来自 increase() 函数的意外结果，原因如下：

Prometheus 可能 return 来自 increase() 的小数结果超过整数计数器，因为外推。有关详细信息，请参阅 this issue。
Prometheus 可能 return 低于 increase(m[d]) 的预期结果，因为它没有考虑在指定的回顾 window 之前的最后一个原始样本之间可能出现的计数器增加 [d] 和 lookbehind 中的第一个原始样本 window [d]。有关详细信息，请参阅 this article and this comment。
Prometheus 跳过时间序列中第一个样本的增加。例如，increase() 在以下样本系列中将 return 1 而不是 11：10 11 11。有关详细信息，请参阅 these docs。

这些问题将根据 this design doc. In the mean time it is possible to use other Prometheus-like systems such as VictoriaMetrics 解决，这些问题没有这些问题。

Prometheus增加不处理进程重启

Prometheus increase not handling process restarts

monitoring

metrics

prometheus