CUDA 模板化函数调用

Question

这个运行时间错误困扰了我两天，我试了所有可能的方法来调试它，仍然找不到问题所在。

#define CUDA_RT_CALL( call ){
cudaError_t cudaStatus = call;
if ( cudaSuccess != cudaStatus )
  fprintf(stderr, "ERROR: CUDA RT call \"%s\" in line %d of file %s failed with %s (%d).\n",
  #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus);
}

template <typename Tkey, typename Tvalue>
void KernelDriver(Tkey * K, Tvalue * V, int N, long * h_h, long * h_p, int A){
  Tkey * d_keys_in;
  Tvalue * d_values_in;
  CUDA_RT_CALL(cudaMalloc((void**)&d_keys_in, sizeof(Tkey)*N));
  CUDA_RT_CALL(cudaMalloc((void**)&d_values_in, sizeof(Tvalue)*N));
  CUDA_RT_CALL(cudaMemcpy(d_keys_in, K, sizeof(Tkey)*N, cudaMemcpyHostToDevice));
  CUDA_RT_CALL(cudaMemcpy(d_values_in, V, sizeof(Tvalue)*N, cudaMemcpyHostToDevice));

  /* myKernel() */
}

以上代码，编译正常。但是，我运行编译cuda程序时，只要键值对是int-long，i.g.,

KernelDriver<int, long>((int *)key, (long *)value, n, h_histo, h_prefix, agg);

CUDA 运行时间API报错：

ERROR: CUDA RT call "cudaMemcpy(d_values_in, V, sizeof(Tvalue)*N, cudaMemcpyHostToDevice)" in line 295 of file gpucode.cu failed with invalid argument (11).

另外，当插件键值对为int-double时，i.g.,

KernelDriver<int, double>((int *)key, (double *)value, n, h_histo, h_prefix, agg);

完全没有错误，运行完全没问题。我试图在主机和设备上打印 sizeof(long)，它们都是 8 个字节。现在，我不知道这是什么问题。

Answer 1

我刚刚自己找到了解决方案。 "long" 不同的机器有不同的字节大小，有些机器是 4 字节，有些是 8 字节。确保它们在编译器和体系结构中编译，否则，cudaMemcpy 将无法复制两个不同块大小的内存。

CUDA 模板化函数调用

CUDA templated function call

templates

cuda

runtime-error

runtime