heapq.nsmallest 是如何工作的

Question

我正在尝试根据字典中最小的 k 个键来确定获取 k（键，值）对的最快运行时间。 IE。：对于

mynahs = {40:(1,3),5:(5,6),11:(9,2),2:(6,3),300:(4,4),15:(2,8)}

smallestK(mynahs,3)

会return:

[(2,(6,3)),(5,(5,6)),(11,(9,2))]

我见过几种不同的方法：
1.

mylist = list(mynahs.keys())
mylist.sort
mylist = mylist[:k]
return [(k, mynahs[k]) for k in mylist]

但似乎每个人都认为 heapq 是最快的

cheap = heapq.nsmallest(3, mynahs)
return [(k, mynahs[k]) for k in cheap]

heapq.nsmallest 是如何工作的，为什么速度最快？我看过 this question and this one 我还是不明白。 heapq 是否使用 minheap 来获得 nsmallest？这是如何运作的？我还听说过一种叫做 quickselect 的算法，它使用的是什么？

它的运行时间是多少？如果字典一直是 changing/updating，那么每次您需要 nsmallest 时调用 heapq.nsmallest 是最快的方法吗？

Answer 1

heapq 使用堆 (_heapify_max)

这是 heapq.nsmallest - https://github.com/python/cpython/blob/master/Lib/heapq.py#L395

的实现

另请参阅：

http://code.activestate.com/recipes/577573-compare-algorithms-for-heapqsmallest/

Answer 2

heapq.py 的代码可在 https://svn.python.org/projects/python/trunk/Lib/heapq.py

nsmallest 使用两种算法之一。如果要返回的项目数超过堆中项目总数的 10%，则它复制列表，对其进行排序，returns 前 k 项。

如果k小于n/10，则使用堆选择算法：

Make a copy of the first k items, and sort it
for each remaining item in the original heap
    if the item is smaller than the largest item in the new list
        replace the largest item with the new item
        re-sort the new list

写这篇文章的人使用了如此低效的算法，这有点令人惊讶。至少在理论上，Quick select 是一个 O(n) 算法，应该比排序更快，并且比选择 n/10 项的 "optimized" 算法快得多。

我不是 Python 人，所以我不能肯定地说，但我使用其他语言的经验表明上述内容也适用于 Python。

更新

https://github.com/python/cpython/blob/master/Lib/heapq.py#L395 的实施方式有些不同。

如果 k 大于或等于列表中的项目数，则返回包含所有元素的排序列表。否则，它使用标准堆选择算法：

create a max heap from the first k items
for each remaining item
    if the item is smaller than the largest item on the heap
        remove the largest item from the heap
        add the new item to the heap
sort the resulting heap and return

remove/add 组合成一个名为 heap_replace 的函数。

如果键是 None，其中有一个使用标准比较器的优化，但它使用相同的基本堆选择算法。

这个实现比我描述的另一个更有效，尽管我预计它在一般情况下会比 Quickselect 慢。

heapq.nsmallest 是如何工作的

How does heapq.nsmallest work

python

sorting

heap

dictionary

runtime

更新