为什么相同的 OpenCL 代码在 Intel Xeon CPU 和 NVIDIA GTX 1080 Ti GPU 上有不同的输出？

Question

我正在尝试使用 OpenCL 并行化 Monte Carlo 模拟。我使用 MWC64X 作为统一随机数生成器。代码运行在不同的英特尔 CPU 上运行良好，因为并行计算的输出非常接近顺序计算。

Using OpenCL device: Intel(R) Xeon(R) CPU E5-2630L v3 @ 1.80GHz
Literal influence running time: 0.029048 seconds        r1 seqInfl= 0.4771
Literal influence running time: 0.029762 seconds        r2 seqInfl= 0.4771
Literal influence running time: 0.029742 seconds        r3 seqInfl= 0.4771
Literal influence running time: 0.02971 seconds         ra seqInfl= 0.4771
Literal influence running time: 0.029225 seconds        trust1-57 seqInfl= 0.6001
Literal influence running time: 0.04992 seconds         trust110-1 seqInfl= 0
Literal influence running time: 0.034636 seconds        trust4-57 seqInfl= 0
Literal influence running time: 0.049079 seconds        trust57-110 seqInfl= 0
Literal influence running time: 0.024442 seconds        trust57-4 seqInfl= 0.8026
Literal influence running time: 0.04946 seconds         trust33-1 seqInfl= 0
Literal influence running time: 0.049071 seconds        trust57-33 seqInfl= 0
Literal influence running time: 0.053117 seconds        trust4-1 seqInfl= 0.1208
Literal influence running time: 0.051642 seconds        trust57-1 seqInfl= 0
Literal influence running time: 0.052052 seconds        trust57-64 seqInfl= 0
Literal influence running time: 0.052118 seconds        trust64-1 seqInfl= 0
Literal influence running time: 0.051998 seconds        trust57-7 seqInfl= 0
Literal influence running time: 0.052069 seconds        trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.71728 seconds
Sequential maxInfluence Literal: trust57-4 0.8026

index1= 17 size= 51 dim1_size= 6
sum0:4781   influence0:0.478100 sum2:4781   influence2:0.478100 sum6:0  influence6:0.000000 sum10:0 sum12:0 influence12:0.000000    sum7:0  influence7:0.000000 influence10:0.000000    sum4:5962   influence4:0.596200 sum8:7971   influence8:0.797100 sum1:4781   influence1:0.478100 sum3:4781   influence3:0.478100 sum13:0 influence13:0.000000    sum11:1261  influence11:0.126100    sum9:0  influence9:0.000000 sum14:0 influence14:0.000000    sum5:0  influence5:0.000000 sum15:0 influence15:0.000000    sum16:0 influence16:0.000000    
Parallel influence running time: 0.054391 seconds
Parallel maxInfluence Literal: trust57-4 Infl=0.7971

然而，当我运行 GeForce GTX 1080 Ti 上的代码，安装了 NVIDIA-SMI 430.40 和 CUDA 10.1 以及 OpenCL 1.2 CUDA 时，输出如下：

Using OpenCL device: GeForce GTX 1080 Ti
Influence:
Literal influence running time: 0.011119 seconds        r1 seqInfl= 0.4771
Literal influence running time: 0.011238 seconds        r2 seqInfl= 0.4771
Literal influence running time: 0.011408 seconds        r3 seqInfl= 0.4771
Literal influence running time: 0.01109 seconds         ra seqInfl= 0.4771
Literal influence running time: 0.011132 seconds        trust1-57 seqInfl= 0.6001
Literal influence running time: 0.018978 seconds        trust110-1 seqInfl= 0
Literal influence running time: 0.013093 seconds        trust4-57 seqInfl= 0
Literal influence running time: 0.018968 seconds        trust57-110 seqInfl= 0
Literal influence running time: 0.009105 seconds        trust57-4 seqInfl= 0.8026
Literal influence running time: 0.018753 seconds        trust33-1 seqInfl= 0
Literal influence running time: 0.018583 seconds        trust57-33 seqInfl= 0
Literal influence running time: 0.02005 seconds         trust4-1 seqInfl= 0.1208
Literal influence running time: 0.01957 seconds         trust57-1 seqInfl= 0
Literal influence running time: 0.019686 seconds        trust57-64 seqInfl= 0
Literal influence running time: 0.019632 seconds        trust64-1 seqInfl= 0
Literal influence running time: 0.019687 seconds        trust57-7 seqInfl= 0
Literal influence running time: 0.019859 seconds        trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.272032 seconds
Sequential maxInfluence Literal: trust57-4 0.8026

index1= 17 size= 51 dim1_size= 6
sum0:10000  sum1:10000  sum2:10000  sum3:10000  sum4:10000  sum5:0  sum6:0  sum7:0  sum8:10000  sum9:0  sum10:0 sum11:0 sum12:0 sum13:0 sum14:0 sum15:0 sum16:0 
Parallel influence running time: 0.193581 seconds

"Influence" 值等于 sum*1.0/10000，因此并行影响仅由 1 和 0 组成，这是不正确的（在 GPU 运行s 中）并且在并行化时不会发生在英特尔 CPU.

当我检查随机数生成器的输出时 if(flag==0) printf("randint=%u",randint);，GPU 上的输出似乎全为零。下面是 clinfo 和 .cl 代码：

 Device Name                                     GeForce GTX 1080 Ti
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 1.2 CUDA
  Driver Version                                  430.40
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Topology (NV)                            PCI-E, 68:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               28
  Max clock frequency                             1721MHz
  Compute Capability (NV)                         6.1
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024
  Preferred work group size multiple              32
  Warp size (NV)                                  32
  Preferred / native vector sizes                 
    char                                                 1 / 1       
    short                                                1 / 1       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (n/a)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              11720130560 (10.92GiB)
  Error Correction support                        No
  Max memory allocation                           2930032640 (2.729GiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        458752 (448KiB)
  Global Memory cache line size                   128 bytes
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             16384x32768 pixels
    Max 3D image size                             16384x16384x16384 pixels
    Max number of read image args                 256
    Max number of write image args                16
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max number of constant args                     9
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Kernel execution timeout (NV)                 Yes
  Concurrent copy and kernel execution (NV)       Yes
    Number of async copy engines                  2
  printf() buffer size                            1048576 (1024KiB)

#define N 70 // N > index, which is the total number of literals
#define BASE 4294967296UL

//! Represents the state of a particular generator
typedef struct{ uint x; uint c; } mwc64x_state_t;
enum{ MWC64X_A = 4294883355U };
enum{ MWC64X_M = 18446383549859758079UL };

void MWC64X_Step(mwc64x_state_t *s)
{
    uint X=s->x, C=s->c;

    uint Xn=MWC64X_A*X+C;
    uint carry=(uint)(Xn<C);                // The (Xn<C) will be zero or one for scalar
    uint Cn=mad_hi(MWC64X_A,X,carry);  

    s->x=Xn;
    s->c=Cn;
}

//! Return a 32-bit integer in the range [0..2^32)
uint MWC64X_NextUint(mwc64x_state_t *s)
{
    uint res=s->x ^ s->c;
    MWC64X_Step(s);
    return res;
}


__kernel void setInfluence(const int literals, const int size, const int dim1_size, __global int* lambdas, __global float* lambdap, __global int* dim2_size, __global float* influence){   
    int flag=get_global_id(0);
    int sum=0;
    int count=10000;
    int assignment[N];
    //or try to get newlambda like original version does
    if(flag < literals){
        mwc64x_state_t rng;
        for(int i=0; i<count; i++){
            for(int j=0; j<size; j++){
                uint randint=MWC64X_NextUint(&rng);
                float rand=randint*1.0/BASE;
                //if(flag==0)
                //  printf("randint=%u",randint);
                if(lambdap[j]<rand)
                    assignment[lambdas[j]]=0;
                else
                    assignment[lambdas[j]]=1;               
            }
            //the true case
            assignment[flag]=1;
            int valuet=0;
            int index=0;
            for(int m=0; m<dim1_size; m++){
                int valueMono=1;
                for(int n=0; n<dim2_size[m]; n++){
                    if(assignment[lambdas[index+n]]==0){
                        valueMono=0;
                        index+=dim2_size[m];
                        break;
                    }
                }
                if(valueMono==1){
                    valuet=1;
                    break;
                }
            }        
            //the false case
            assignment[flag]=0;
            int valuef=0;
            index=0;
            for(int m=0; m<dim1_size; m++){
                int valueMono=1;
                for(int n=0; n<dim2_size[m]; n++){
                    if(assignment[lambdas[index+n]]==0){
                        valueMono=0;
                        index+=dim2_size[m];
                        break;
                    }
                }
                if(valueMono==1){
                    valuef=1;
                    break;
                }
            }
            sum += valuet-valuef;            
        }
        influence[flag] = 1.0*sum/count;
        printf("sum%d:%d\t", flag, sum);
    }
}

运行在 GPU 上运行代码时可能会出现什么问题？是 MWC64X 吗？根据其作者的说法，它可以在 NVIDIA GPU 上表现良好。如果是这样，我该如何解决？如果不是，可能是什么问题？

Answer 1

伪随机数的所有用例都是真实-[PARALLEL]计算平台（不是语言，平台）中的下一级挑战。

要么，
有一些随机源，这让我们陷入了一次大规模的麻烦-并行请求将以真正 [PARALLEL] 的方式得到公平处理（在这里，硬件资源可能会有所帮助，但代价是无法在同一平台上重现相同的行为 "outside"（并且时刻，如果这样的源不是具有某些种子注入功能的软件操作，则可能会设置 "just"-伪随机算法，该算法创建一个纯 [SERIAL] 序列 -产生了"just"-伪随机数 ) )

或者，
有一些"shared"-生成器的伪随机数，享受更高级别的系统级熵（这有利于产生 "quality" 的伪随机性）但以纯串行依赖为代价（不可能并行执行，串行序列得到一个服务一个接一个地按顺序进行）并且可重复运行的机会几乎为零（可重复科学的必要条件）提供可重复的相同序列，这是测试和方法验证案例所需的。

继续：

代码可能会使用工作项-"private"伪随机生成函数_{（为了两者[=47，隐私是必须的=] 并行代码执行和生成这些伪随机数的相互独立性（非干预过程））}，但每个实例都必须 a) 独立初始化，以便提供可在并行化代码中实现的预期随机性水平 -运行s 和 b) 任何 此类初始化应以可重复再现的方式执行方式，为了运行在不同的时间进行测试，经常使用不同的OpenCL目标计算平台。

对于 __kernel-s，不依赖于特定硬件的随机源，满足条件 a && b 就足以接收可重复再现的（相同）测试 "in vitro" 的结果，从而提供在通用生产级用例代码期间生成结果的合理随机方法-运行s "in vivo".

net-运行次（上面的基准测试）的比较似乎表明 Amdahl's law add-on overhead costs 加上工作原子性的尾端效应最终决定了 net-运行-time 在 XEON 上比 GPU 快 ~ 3.6x：

index1 = 17 size = 51 dim1_size = 6 sum0: 4781 influence0: 0.478100 sum2: 4781 influence2: 0.478100 sum6: 0 influence6: 0.000000 sum10: 0 influence10: 0.000000 sum12: 0 influence12: 0.000000 sum7: 0 influence7: 0.000000 sum4: 5962 influence4: 0.596200 sum8: 7971 influence8: 0.797100 sum1: 4781 influence1: 0.478100 sum3: 4781 influence3: 0.478100 sum13: 0 influence13: 0.000000 sum11: 1261 influence11: 0.126100 sum9: 0 influence9: 0.000000 sum14: 0 influence14: 0.000000 sum5: 0 influence5: 0.000000 sum15: 0 influence15: 0.000000 sum16: 0 influence16: 0.000000 Parallel influence running time: 0.054391 seconds on XEON E5-2630L v3 @ 1.80GHz using OpenCL |.... index1 = 17 |.... size = 51 |.... dim1_size = 6 |.... sum0: 10000 |.... sum1: 10000 |.... sum2: 10000 |.... sum3: 10000 |.... sum4: 10000 |.... sum5: 0 |.... sum6: 0 |.... sum7: 0 |.... sum8: 10000 |.... sum9: 0 |.... sum10: 0 |.... sum11: 0 |.... sum12: 0 |.... sum13: 0 |.... sum14: 0 |.... sum15: 0 |.... sum16: 0 |.... Parallel influence running time: 0.193581 seconds on GeForce GTX 1080 Ti using OpenCL

Answer 2

（这开始是评论，事实证明这是问题的根源，所以我把它变成了答案。）

您没有在读取 mwc64x_state_t rng; 变量之前对其进行初始化，因此任何结果都将是未定义的：

    mwc64x_state_t rng;
    for(int i=0; i<count; i++){
        for(int j=0; j<size; j++){
            uint randint=MWC64X_NextUint(&rng);

其中 MWC64X_NextUint() 在更新之前立即从 rng 状态读取：

uint MWC64X_NextUint(mwc64x_state_t *s)
{
    uint res=s->x ^ s->c;

请注意，您可能希望为每个工作项设置不同的 RNG 种子，否则您会在结果中得到令人讨厌的相关伪影。

为什么相同的 OpenCL 代码在 Intel Xeon CPU 和 NVIDIA GTX 1080 Ti GPU 上有不同的输出？

Why does the same OpenCL code have different outputs from Intel Xeon CPU and NVIDIA GTX 1080 Ti GPU?

parallel-processing

gpu

nvidia

opencl

继续：