在查找数组的最大元素时比较 2 种不同架构上的 2 种不同场景

Question

假设我们有一个经典场景，我们需要找到数组的最大元素（仅限整数），而不是它的位置。以下 2 个代码示例（位于 'for' 循环内）在 CPU 上运行速度更快，哪个在 GPU 上运行更快，为什么？

if( array[i] > max)
  max = array[i];

和

max = 0.5 * ( a + b + abs(a-b));      //Where 'a' and 'b' refer to 'max' and 'array[i]'

此外，在第二段代码中真正困扰我的是 'abs' 函数调用。有没有办法只用算术表达式计算一个数的绝对值？

Answer 1

根据工作完成时间在 cpu 和 gpu 上对其进行负载平衡。

假设 tx 是 cpu 时间，ty 是 gpu 时间，wx 是 cpu 工作份额百分比，wy 是 gpu 工作份额百分比。

Iteration 1
wx=0.5;  wy=0.5; // just a guess as 1 / (total number of devices)

Iteration 2
px= wx/tx; ----> compute power of cpu
py= wy/ty; ----> compute power of gpu
because doing work in less time means more


ptotal=px+py;

wx=px/ptotal;
wy=py/ptotal;

setting work share to exact power can alternate next shares
so you may need relaxation constant


wx=0.3 * (px/ptotal);
wy=0.3 * (py/ptotal);

so small changes in instantenous compute power won't bug this.

Iteration 3:

px= wx/tx;
py= wy/ty; 

ptotal=px+py;

wx=0.3 * (px/ptotal);
wy=0.3 * (py/ptotal);

但在 opencl 中，您会为他们提供适当的本地工作大小，并且工作共享必须在解决本地工作大小的过程中完成。

global_range_x= nearest_multiple_of_256(wx * total_global_range);
global_range_y= nearest_multiple_of_256(wy * total_global_range);

如果全局范围的总和等于总范围，则可以根据其他设备的总范围计算每个设备的偏移量。

如果 cpu 的计算范围为 768，而 gpu 的范围为 256，您可以将它们的全局偏移设置为 0(cpu) 和 768(gpu)，这样它们就不会重叠。

Answer 2

我认为您真的想问的是无分支与分支方式m = max(m, array[i])。根据优化设置，C 编译器已经将 if() 版本编译为无分支代码（使用 cmov）。它甚至可以自动向量化为压缩比较或压缩最大函数。

你的 0.5 * abs() 版本显然很糟糕（比条件移动慢很多）因为它转换为 double 并返回。而不是右移除以二。

参见 the Godbolt Compiler Explorer 上的汇编：

// auto-vectorizes to PMAXSD, or without SSE4.1, to pcmpgt / pand/por emulation of the same
int maxarray_if(int arr[], int n) {
  int result = arr[0];
  for (int i=0 ; i<n; ++i) {
    int tmp = arr[i];
    if (result < tmp)
      result = tmp;
  }
  return result;
}

gcc 5.3 -O3 -march=haswell -mno-avx 自动矢量化内循环：

.L13:
    add     eax, 1
    pmaxsd  xmm0, XMMWORD PTR [rdx]
    add     rdx, 16
    cmp     r8d, eax
    ja      .L13

比。 FP 版本：

    ... whole bunch of integer crap
    cvtsi2sd        xmm0, eax
    mulsd   xmm0, xmm1
    cvttsd2si       eax, xmm0

所以FP版明显是垃圾

对于任何目标架构，您都会得到类似的结果。转换 to/from double 不会消失。 gcc 甚至使用 -ffast-math.

保持它

在查找数组的最大元素时比较 2 种不同架构上的 2 种不同场景

Comparing 2 different scenarios on 2 different architectures when finding the max element of an array

c

optimization

cuda

gpgpu

opencl