推荐 C++ 容器保存前 20 个最小值

Question

在SQL中有一个功能可以说

SELECT TOP 20 distance FROM dbFile ORDER BY distance ASC

如果我的 SQL 是正确的，比如说 10,000 条记录，这应该 return 我的数据库中的 20 个最小距离。

我没有数据库。我有一个包含 100,000 个元素的简单数组。

是否有 C++ 容器，Boost, MFC or STL 可以为像

这样的结构提供简单的代码

struct closest{
    int    ID;
    double distance;
    closest():ID(-1), distance(std::numeric_limits<double>::max( )){}
};

我可以在哪里构建按距离排序的容器，例如

boost::container::XXXX<closest> top(20);

然后有一个简单的

top.replace_if(closest(ID,Distance));

容器将用我的新条目替换容器中当前最大距离的条目，如果它小于我容器中当前的最大距离。

我不担心速度。我喜欢优雅干净的解决方案，其中容器和代码完成所有 繁重的工作。

编辑。收到所有很棒的答案后的附录。

我真的很想找到它，因为它很优雅。是我可以创建的具有容器大小限制的排序容器。在我的例子中是 20。然后我可以推送或插入 100 000 项或更多项到我的心中。但。总有一个但是。如果容器的比较器值不在最低的 20 个值内，则容器将通过替换或不插入项目来保持最大大小 20。

是的。我现在从所有这些答案中知道，通过编程和调整现有容器可以达到相同的效果。也许在下一轮针对 C 和 C++ 标准委员会的建议召开时。我们可以建议。 自分类（我们已经有了）和自限制容器大小。

Answer 1

我的第一个想法是为此使用带有自定义比较器的 std::map 或 std::set（编辑：或者更好，如评论中所述的 std::priority_queue）。

您的比较器为您排序。

您基本上将所有元素添加到其中。添加一个元素后，检查里面的元素是否超过n个。如果有，去掉最后一个。

Answer 2

我不是 100% 确定没有更优雅的解决方案，但即使是 std::set 也很漂亮。

您所要做的就是为您的元素定义一个合适的比较器（例如 > 运算符），然后执行以下操作：

std::set<closest> tops(arr, arr+20)
tops.insert(another);
tops.erase(tops.begin());

Answer 3

您需要的是大小为 20 的 maxheap。回想一下，您的堆的根将是堆中的最大值。

该堆将包含您目前遇到的距离最小的记录。对于 10000 个值中的前 20 个，您只需将其推送到堆中。

此时您遍历其余记录，并将每条记录与堆的根进行比较。

记住你堆的根基本上是最好中最差的。（距离最大的记录，在你目前遇到的距离最短的20条记录中）。

如果您考虑的值不值得保留（它的距离大于树的根），请忽略该记录并继续移动。

否则你弹出你的堆（摆脱根）并推入新值。优先级队列将自动将其距离最大的记录再次放在根上。

在整个 10000 个值的集合上继续执行此操作后，您将剩下距离最小的 20 个记录，这就是您想要的。

每个 push-pop 需要恒定的 O(1) 时间，遍历 N 的所有输入是 O(n)，所以这是一个线性解决方案。

编辑：我认为用 C++ 代码展示我的想法会很有用。这是一个玩具示例，您可以使用模板编写通用版本，但我选择保持简单和简约：

#include <iostream>
#include <queue>
using namespace std;
class smallestElements
{
private:
    priority_queue<int,std::vector<int>,std::less<int> > pq;
    int maxSize;
public:
    smallestElements(int size): maxSize(size)
    {
    pq=priority_queue<int, std::vector<int>, std::less<int> >();
    }
    void possiblyAdd(int newValue)
    {
        if(pq.size()<maxSize)
        {
            pq.push(newValue);
            return;
        }
        if(newValue < pq.top())
        {
            pq.pop(); //get rid of the root
            pq.push(newValue); //priority queue will automatically restructure
        }
    }
    void printAllValues()
    {
        priority_queue<int,std::vector<int>,std::less<int> > cp=pq;
        while(cp.size()!=0)
        {
            cout<<cp.top()<<" ";
            cp.pop();
        }
        cout<<endl;
    }
};

你如何使用它真的很简单。基本上在您的主要功能中，您将拥有：

smallestElements se(20); //we want 20 smallest
//...get your stream of values from wherever you want, call the int x
se.possiblyAdd(x); //no need for bounds checking or anything fancy
//...keep looping or potentially adding until the end

se.printAllValues();//shows all the values in your container of smallest values
// alternatively you can write a function to return all values if you want

Answer 4

如果这是关于动态地从流中过滤 20 个最小的元素，那么基于 std::priority_queue（或 std::multiset）的解决方案就是可行的方法。

但是，如果它是关于在给定数组中找到 20 个最小的元素，我根本不会选择特殊的容器，而只是算法 std::nth_element - 一种部分排序算法，它会给你n 个最小的元素 - 编辑：或 std::partial_sort（感谢 Jarod42）如果元素也必须排序。它具有线性复杂度，只需一行即可编写（+ 比较运算符，在任何情况下都需要）：

#include <vector>
#include <iostream>
#include <algorithm>

struct Entry {
    int ID;
    double distance;    
};

std::vector<Entry> data;    

int main() {
    //fill data;

    std::nth_element(data.begin(), data.begin() + 19, data.end(), 
        [](auto& l, auto& r) {return l.distance < r.distance; });

    std::cout << "20 elements with smallest distance: \n"; 
    for (size_t i = 0; i < 20; ++i) {
        std::cout << data[i].ID << ":" << data[i].distance << "\n";
    }
    std::cout.flush();
}

如果您不想更改原始数组的顺序，则必须先复制整个数组。

Answer 5

我会像@juanchopanza 在他删除之前建议的那样使用 nth_element。

他的代码如下：

bool comp(const closest& lhs, const closest& rhs)
{
    return lhs.distance < rhs.distance;
}

然后

std::vector<closest> v = ....;
nth_element(v.begin(), v.begin() + 20, v.end(), comp);

虽然如果它只有 20 个元素，那么我会使用 std::array。

Answer 6

只是为了让你们都能看到我目前正在做的似乎有效的事情。

struct closest{
    int base_ID;
    int ID;
    double distance;

    closest(int BaseID, int Point_ID, 
    double Point_distance):base_ID(BaseID), 
    ID(Point_ID),distance(Point_distance){}
    closest():base_ID(-1), ID(-1),
    distance(std::numeric_limits<double>::max( )){}

    bool operator<(const closest& rhs) const
    {
        return distance < rhs.distance;
    }
};



void calc_nearest(void) 
{ 
    boost::heap::priority_queue<closest> svec;

    for (int current_gift = 0; current_gift < g_nVerticesPopulated; ++current_gift)
    {   double best_distance=std::numeric_limits<double>::max();    
        double our_distance=0.0;
        svec.clear();

        for (int all_other_gifts = 0; all_other_gifts < g_nVerticesPopulated;++all_other_gifts)
        {   
            our_distance = distanceVincenty(g_pVertices[current_gift].lat,g_pVertices[current_gift].lon,g_pVertices[all_other_gifts].lat,g_pVertices[all_other_gifts].lon);
            if (our_distance != 0.0)
            {
                if (our_distance < best_distance) // don't bother to push and sort if the calculated distance is greater than current 20th value
                    svec.push(closest(g_pVertices[current_gift].ID,g_pVertices[current_gift].ID,our_distance));

                if (all_other_gifts%100 == 0)
                {
                    while (svec.size() > no_of_closest_points_to_calculate) svec.pop(); // throw away any points above no_of_closest_points_to_calculate
                    closest t =  svec.top(); // the furthest of the no_of_closest_points_to_calculate points for optimisation
                    best_distance = t.distance;
                }
            }
        }
        std::cout << current_gift << "\n";
    }
}

如你所见。我在 openGl 球体上绘制了 100 000 个纬度和经度点。我正在计算每个点与其他点的对比，目前只保留最接近的 20 个点。如果值大于第 20 个最近点，则不推送值，从而进行一些原始优化。

因为我习惯了 Prolog 需要几个小时才能解决问题，所以我不担心速度。我会运行这个晚上。

感谢大家的帮助。非常感谢。仍然需要审核代码和结果，但很高兴我正朝着正确的方向前进。

Answer 7

我最近在此处post针对检索 前 5 个最小值 的类似问题提供了多种方法：

有些实现以不同的方式从输入向量中保留特定数量的最小或最大项。 nth_element 算法执行部分排序，优先级队列维护一个堆，集合维护一个二叉搜索树，基于双端队列和向量的方法只是根据（线性）min/max 删除一个元素搜索。

implement a custom comparison operator 并调整项目数量以保持 n.

应该相当容易

这是代码（根据另一个 post 重构）：

#include <algorithm>
#include <functional>
#include <queue>
#include <set>
#include <vector>
#include <random>
#include <iostream>
#include <chrono>

template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_nth_element(std::vector<T> v, typename std::vector<T>::size_type n) {
    auto target = v.begin()+n;
    std::nth_element(v.begin(), target, v.end(), Compare());
    std::vector<T> result(v.begin(), target);
    return result;
}

template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_pqueue(std::vector<T> v, typename std::vector<T>::size_type n) {
    std::vector<T> result;
    std::priority_queue<T, std::vector<T>, Compare> q;
    for (auto i: v) {
        q.push(i);
        if (q.size() > n) {
            q.pop();
        }
    }
    while (!q.empty()) {
        result.push_back(q.top());
        q.pop();
    }
    return result;
}

template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_set(std::vector<T> v, typename std::vector<T>::size_type n) {
    std::set<T, Compare> s;
    for (auto i: v) {
        s.insert(i);
        if (s.size() > n) {
            s.erase(std::prev(s.end()));
        }
    }
    return std::vector<T>(s.begin(), s.end());
}

template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_deque(std::vector<T> v, typename std::vector<T>::size_type n) {
    std::deque<T> q;
    for (auto i: v) {
        q.push_back(i);
        if (q.size() > n) {
            q.erase(std::max_element(q.begin(), q.end(), Compare()));
        }
    }
    return std::vector<T>(q.begin(), q.end());
}

template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_vector(std::vector<T> v, typename std::vector<T>::size_type n) {
    std::vector<T> q;
    for (auto i: v) {
        q.push_back(i);
        if (q.size() > n) {
            q.erase(std::max_element(q.begin(), q.end(), Compare()));
        }
    }
    return q;
}

template <typename Clock = std::chrono::high_resolution_clock>
struct stopclock {
    std::chrono::time_point<Clock> start;
    stopclock() : start(Clock::now()) {}
    ~stopclock() {
        auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start);
        std::cout << "elapsed: " << elapsed.count() << "ms\n";
    }
};

std::vector<int> random_data(std::vector<int>::size_type n) {
    std::mt19937 gen{std::random_device()()};
    std::uniform_int_distribution<> dist;
    std::vector<int> out(n);
    for (auto &i: out)
        i = dist(gen);
    return out;
}

int main() {
    std::vector<int> data = random_data(1000000);
    stopclock<> sc;
    std::vector<int> result = filter_nth_element(data, 5);
    std::cout << "smallest values: ";
    for (auto i : result) {
        std::cout << i << " ";
    }
    std::cout << "\n";
    std::cout << "largest values: ";
    result = filter_nth_element<int, std::greater<int>>(data, 5);
    for (auto i : result) {
        std::cout << i << " ";
    }
    std::cout << "\n";
}

示例输出为：

$ g++ test.cc -std=c++11 && ./a.out
smallest values: 4433 2793 2444 4542 5557 
largest values: 2147474453 2147475243 2147477379 2147469788 2147468894 
elapsed: 123ms

请注意，在这种情况下，只有第 n 个元素的位置相对于提供的比较运算符所施加的顺序是准确的。然而，其他元素保证为 smaller/greater 或等于那个元素，具体取决于提供的比较运算符。也就是说，返回了前 n min/max 个元素，但它们没有正确排序。

不要指望其他算法也能按特定顺序产生结果。（虽然使用优先级队列和集合的方法实际上产生排序输出，但它们的结果顺序相反）。

供参考：

Answer 8

堆是你需要的数据结构。 C++11 之前的 stl 只有在您自己的数组中管理堆数据的函数。有人提到 boost 有一个堆 class，但如果您的数据是简单的整数，则无需使用 boost。 stl 的堆就可以了。当然，该算法是对堆进行排序，使最高值排在第一个。因此，对于每个新值，您都将其推送到堆上，一旦堆的大小达到 21 个元素，就从堆中弹出第一个值。这样剩下的20个值总是最低的20个。

Answer 9

I actually have 100 000 Lat & Lon points drawn on a opengl sphere. I want to work out the 20 nearest points to each of the 100 000 points. So we have two loops to pick each point then calculate that point against every other point and save the closest 20 points.

这看起来好像您想首先执行 k 最近邻搜索。为此，您通常使用专门的数据结构（例如，二叉搜索树）来加速查询（尤其是当您执行 100k 个查询时）。

对于球坐标，您必须转换为笛卡尔坐标 space 才能修复坐标环绕。然后你会使用八叉树或 kD 树。

这是一种使用近似最近邻 (FLANN) 快速库的方法：

#include <vector>
#include <random>
#include <iostream>
#include <flann/flann.hpp>
#include <cmath>

struct Point3d {
    float x, y, z;
    void setLatLon(float lat_deg, float lon_deg) {
        static const float r = 6371.; // sphere radius
        float lat(lat_deg*M_PI/180.), lon(lon_deg*M_PI/180.);
        x = r * std::cos(lat) * std::cos(lon);
        y = r * std::cos(lat) * std::sin(lon);
        z = r * std::sin(lat);
    }
};

std::vector<Point3d> random_data(std::vector<Point3d>::size_type n) {
    static std::mt19937 gen{std::random_device()()};
    std::uniform_int_distribution<> dist(0, 36000);
    std::vector<Point3d> out(n);
    for (auto &i: out)
        i.setLatLon(dist(gen)/100., dist(gen)/100.);
    return out;
}

int main() {
    // generate random spherical point cloud
    std::vector<Point3d> data = random_data(1000);
    // generate query point(s) on sphere
    std::vector<Point3d> query = random_data(1);

    // convert into library datastructures
    auto mat_data = flann::Matrix<float>(&data[0].x, data.size(), 3);
    auto mat_query = flann::Matrix<float>(&query[0].x, query.size(), 3);
    // build KD-Tree-based index data structure
    flann::Index<flann::L2<float> > index(mat_data, flann::KDTreeIndexParams(4));
    index.buildIndex();
    // perform query: approximate nearest neighbor search
    int k = 5; // number of neighbors to find
    std::vector<std::vector<int>> k_indices;
    std::vector<std::vector<float>> k_dists;
    index.knnSearch(mat_query, k_indices, k_dists, k, flann::SearchParams(128));

    // k_indices now contains for each query point the indices to the neighbors in the original point cloud
    // k_dists now contains for each query point the distances to those neighbors
    // removed printing of results for brevity
}

您会收到与此类似的结果（点击放大）：

供参考：

推荐 C++ 容器保存前 20 个最小值

Recommend C++ container to hold top 20 minimum values

c++

containers

filtering