通过引用传递太多参数可能效率低下？

Question

免责声明：我使用的是 Intel Compiler 2017，如果您想知道我为什么这样做，请在问题末尾进行。

我有这个代码：

class A{
  vector<float> v;
  ...
  void foo();
  void bar();
}

void A::foo(){
  for(int i=0; i<bigNumber;i++){
    //something very expensive
    //call bar() many times per cycle;
  }
}

void A::bar(){
  //...
  v.push_back(/*something*/);
}

现在，假设我想并行化 foo()，因为它非常昂贵。但是，我不能简单地使用 #pragma omp parallel for 因为 v.push_back().

据我所知，这里有两种选择：

我们用#pragma omp critical
我们为每个线程创建一个 v 的本地版本，然后我们在并行部分的末尾连接它们，或多或少如 here 所述。

解决方案 1. 通常被认为是一个糟糕的解决方案，因为 race-condition 会产生一致的开销。

但是解决方案2.需要这样修改bar()：

class A{
  vector<float> v;
  ...
  void foo();
  void bar(std::vector<float> &local_v);
}

void A::foo(){
  #pragma omp parallel
  {
    std::vector<float> local_v;
    #pragma omp for
    for(int i=0; i<bigNumber;i++){
      //something very expensive
      //call bar(local_v) many times per cycle;
    }
    #pragma omp critical
    {
      v.insert(v.end(), local_v.begin(), local_v.end());
    }
  }
}

void A::bar(std::vector<float> &local_v){
  //...
  v.push_back(/*something*/);
}

到目前为止一切顺利。现在，假设不仅有 v，还有 10 个向量，比如 v1、v2、...、v10，或者无论如何有 10 个共享变量。此外，我们假设 bar 不是直接在 foo() 内部调用，而是在许多嵌套调用之后调用。像 foo() 调用 foo1(std::vector<float> v1, ..., std::vector<float> v10) 调用 foo2(std::vector<float> v1, ..., std::vector<float> v10)，多次重复这个嵌套调用直到最后一个调用 bar(std::vector<float> v1, ..., std::vector<float> v10).

所以，这看起来像是可维护性的噩梦（我必须修改所有 headers 和所有嵌套函数的调用）......但更重要的是：我们同意通过引用传递是有效的，但它始终是指针副本。正如你所看到的，这里有很多指针被复制了很多次。有没有可能所有这些副本都导致效率低下？

其实这里我最关心的是性能，所以你告诉我"nah, it's fine because compilers are super intelligent and they do some sorcery so you can copy one trillion of references and there is no drop in performance"就可以了，但我不知道有没有这样的魔法。

我为什么这样做： 我正在尝试并行化 this code. In particular, I'm rewriting the while here as a for which can be parallelized, but if you follow the code you'll find out that the call-back onAffineShapeFound from here 被调用，它修改共享 object keys 的状态。许多其他变量都会发生这种情况，但这是此代码的 "deepest" 情况。

Answer 1

在 a::Bar() 和 a::Bar(std::vector<float> & v) 之间的直接比较中，区别在于第二个版本必须将堆栈的大小比原始版本增加 8 个字节做。就性能而言，这是一个非常小的影响：无论函数是否包含参数，都必须调整堆栈指针（因此唯一真正的区别是单个指针副本，根据编译器的不同，它甚至可能被优化掉)，并且就函数本身的实际性能而言，不断向 std::vector 添加元素将是一个更昂贵的操作，特别是如果向量需要重新分配（这可能会经常发生，取决于向量需要多大），这意味着这些成本将远远超过指针副本的成本。

所以，简短的版本：对参考文献发疯。

通过引用传递太多参数可能效率低下？

Passing too many arguments by reference could be inefficient?

c++

parallel-processing

intel

openmp