给定数百万个数字流，如何近似第 90 个百分位数

Question

我需要计算每秒获得的数字流的第 90 个百分位数。它可能高达每秒数百万个数字，但第 90 个百分位数只是近似值，不一定精确。优先级 queue/max 堆是执行此操作的最佳方法，还是其他方法？如果是这样，我将如何得出近似值？

Answer 1

您 select 的方法将取决于数据的性质。如果您知道，在您开始接收项目流之前，您将接收多少项目，您可以使用基于堆的 selection 算法。例如，如果您知道您将收到 1,000,000 件物品并且您需要知道 90% 的百分位数，那么您就会知道第 100,000 件物品标记第 90 个百分位数。要找到它，请执行以下操作：

create an empty min heap
add the first 100,000 items to the heap
for each remaining item
    if the item is larger than the smallest item on the heap
        remove the smallest item from the heap
        add the new item to the heap

完成后，堆包含 100,000 个最大的项目，堆的根是其中最小的。那是你的第 90 个百分位值。

使用更多内存的更快方法是将所有传入的项目保存在列表中，然后运行 Quickselect 查找第 100,000 个最大的项目。

以上两种都会给你准确的答案。

如果您知道您的数字将在某个相对较小的范围内，您可以创建存储桶来存储它们。例如，您说您的数字在 0 到 150 的范围内。因此您需要 151 个存储桶。您的值不是整数，但由于您说近似值很好，所以您可以在将值放入桶中之前对其进行舍入。所以这样的事情应该有效：

buckets = array of 151 values
for each value
    int_value = round(value)
    buckets[int_value] = buckets[int_value] + 1

现在您已经对每个值进行了计数，计算出第 90 个百分位是一件简单的事情，只需从数组末尾（最高值）开始计算值，直到达到 10%。类似于：

target = 100000  // we want the top 10 percent
bucket = 150
total = 0
while (bucket >= 0)
    total += buckets[bucket]
    if (total >= target)
        break
    bucket = bucket - 1

此时，bucket 的值是您的大约 90% 值。

此方法将比其他两种方法更快，并且使用的内存要少得多。但这是一个近似值而不是精确值。

给定数百万个数字流，如何近似第 90 个百分位数

how to approximate 90th percentile given a stream of millions of numbers

java

heap

statistics

priority-queue

percentile