Fortran 上的 CUDA 矩阵乘法比 C 慢

Question

我正在使用 CUDA Fortran 和 C 执行基本的矩阵乘法，没有任何优化。 Fortran 和 C 都在做完全相同的事情，但 Fortran 的执行时间较慢。

C 内核

#define idx(x,y,z) x*y + z
__global__ void matrixMultiply(double *d_mat, double *d_matT, double *d_matSym) {

    //Global inidices
    int tx = blockIdx.y * blockDim.y + threadIdx.y;
    int ty = blockIdx.x * blockDim.x + threadIdx.x;
    int k;

    if (tx < SIZE && ty < SIZE) {
        double accum = 0.0;
        //Accumulation for (tx,ty) position
        for (k=0; k<SIZE; ++k) {
            accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
        }
        d_matSym[idx(tx,SIZE,ty)] = accum;
    }
}
//Call  
dim3 grid_dim(SIZE/32, SIZE/32, 1);
dim3 blk_dim(32, 32, 1);
matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
cudaDeviceSynchronize();

Fortran 内核

attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)

    integer :: tx, ty, k
    real*8 :: accum
    real*8, dimension(:,:) :: d_mat, d_matT, d_matSym
    
    tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
    ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y

    if (tx <= SIZE_ .and. ty <=SIZE_) then
        accum = 0.0
        do k=1, SIZE_
            accum = accum + d_mat(tx,k) * d_matT(k,ty)
        end do
        d_matSym(tx,ty) = accum
    end if
end subroutine matrixMultiply

!Call
type(dim3) :: grid_dim, blk_dim
grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
blk_dim = dim3(32, 32, 1)
call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
err = cudaDeviceSynchronize()

区别在于 C 使用一维数组，而 Fortran 使用二维数组。但这应该不是问题，因为内存下方是连续的。

如果是内存访问，那么在这两种情况下，K循环访问一个矩阵是连续的，另一个访问是按SIZE跳转的。

两者产生相同的结果。

对于 16384 x 16384 矩阵， C：5.4秒 Fortran：6.3 秒

GPU：特斯拉 V100 32GB

我不确定我做错了什么？

Answer 1

首先，我建议性能题包括完整的代码。我通常需要能够运行的东西，你可以帮我节省一些打字时间。当然，你可以把东西放在外面。当然，我大概能弄清楚它是什么。但我不太可能以这种方式帮助你，而且我怀疑我不是唯一持这种观点的人。我的建议：让其他人更容易帮助你。我在下面给出了有用的示例。

关于问题：

The difference is that C uses a 1D array whereas Fortran uses 2D. But that should not be a problem since underneath the memory will be contiguous.

TL;DR：您的说法（“这不应该成为问题”）显然不能成立。一维分配和二维分配之间的区别很重要，不仅从存储的角度，而且从索引计算的角度。如果您对这个答案的长度很敏感，请跳到此 post.

底部的注释 D

详情：

当我们有这样的循环时：

    for (k=0; k<SIZE; ++k) {
        accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
    }

如果编译器发现索引计算可以随着循环的进行而简单地“递增”，这将相当简单。 OTOH，如果必须在考虑以前在每次循环迭代中引用的 2D 计算（例如 row*width+column）的情况下解决索引，这将更加复杂。这里的一个重要因素也是数组 width 是否在编译时被编译器 usable/used

这似乎是正在发生的事情，我认为这是时间上的差异。我们将为此考虑 2 个证明点。

SASS分析

如果我们比较实现之间的 SASS，我们会发现一些重要的差异。为此，我们需要完整的示例：

CUDA Fortran:

$ cat t11.cuf
module mmk
   USE cudafor
   !
   ! Definition of symbols for real types (RP)
   !
   IMPLICIT NONE
   !
   INTEGER, PARAMETER :: SP = SELECTED_REAL_KIND(6,   37)     ! REAL32
   INTEGER, PARAMETER :: DP = SELECTED_REAL_KIND(15, 307)     ! REAL64
   INTEGER, PARAMETER :: SIZE_ = 16384
   !

   Contains

        attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)

          integer :: tx, ty, k
          REAL(DP) :: accum
          REAL(DP), dimension(:,:) :: d_mat, d_matT, d_matSym

          tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
          ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y

          if (tx <= SIZE_ .and. ty <=SIZE_) then
            accum = 0.0
            do k=1, SIZE_
              accum = accum + d_mat(tx,k) * d_matT(k,ty)
            end do
            d_matSym(tx,ty) = accum
          end if
        end subroutine matrixMultiply


end module mmk

PROGRAM Test
   !
   ! This is the main program for Test
   !
   USE cudafor
   USE mmk

   !
   IMPLICIT NONE
   !
   REAL(DP), ALLOCATABLE, DEVICE, DIMENSION(:,:)   :: d_mat, d_matT, d_matSym
   !
   INTEGER    :: err, i1, i2
   type(dim3) :: grid_dim, blk_dim
   !
   !
   ! Allocate storage for the arrays
   !
   Allocate(d_mat(SIZE_,SIZE_),d_matT(SIZE_,SIZE_),d_matSym(SIZE_,SIZE_))
   !
   ! invoke the kernel
   !

   !Call
   grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
   blk_dim = dim3(32, 32, 1)
   call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
   call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
   err = cudaDeviceSynchronize()


   !
   ! Free storage for the arrays
   !
   Deallocate(d_mat,d_matT,d_matSym)
   !
END PROGRAM Test
$ nvfortran t11.cuf -o t11
$ nvprof ./t11
==38508== NVPROF is profiling process 38508, command: ./t11
==38508== Profiling application: ./t11
==38508== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  12.7168s         2  6.35838s  6.35229s  6.36447s  mmk_matrixmultiply_
                    0.00%  9.2480us         6  1.5410us  1.3120us  2.1760us  [CUDA memcpy HtoD]
      API calls:   97.08%  12.7260s         9  1.41400s  10.596us  6.36561s  cudaFree
                    2.85%  374.24ms         9  41.583ms  3.5780us  369.55ms  cudaMalloc
                    0.03%  4.5190ms         4  1.1297ms  1.0551ms  1.2608ms  cuDeviceTotalMem
                    0.02%  2.5060ms       404  6.2020us     111ns  1.2057ms  cuDeviceGetAttribute
                    0.01%  1.1501ms         4  287.52us  33.014us  1.0465ms  cuDeviceGetName
                    0.00%  120.00us         6  20.000us  5.1530us  62.523us  cudaMemcpy
                    0.00%  99.593us         2  49.796us  44.275us  55.318us  cudaLaunchKernel
                    0.00%  43.414us         1  43.414us  43.414us  43.414us  cudaDeviceSynchronize
                    0.00%  11.563us         4  2.8900us  1.4400us  5.5870us  cuDeviceGetPCIBusId
                    0.00%  1.7030us         8     212ns     124ns     580ns  cuDeviceGet
                    0.00%     901ns         3     300ns     163ns     534ns  cuDeviceGetCount
                    0.00%     855ns         4     213ns     183ns     263ns  cuDeviceGetUuid
$

NVIDIA HPC SDK，2021.2，Tesla V100

我们看到执行持续时间约为 6.3 秒，与您的报告一致。

CUDA C++:

$ cat t12.cu
#define idx(x,y,z) x*y + z
const int SIZE = 16384;
__global__ void matrixMultiply(double *d_mat, double *d_matT, double *d_matSym) {

    //Global inidices
    int tx = blockIdx.y * blockDim.y + threadIdx.y;
    int ty = blockIdx.x * blockDim.x + threadIdx.x;
    int k;

    if (tx < SIZE && ty < SIZE) {
        double accum = 0.0;
        //Accumulation for (tx,ty) position
        for (k=0; k<SIZE; ++k) {
            accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
        }
        d_matSym[idx(tx,SIZE,ty)] = accum;
    }
}
int main(){
  //Call
  dim3 grid_dim(SIZE/32, SIZE/32, 1);
  dim3 blk_dim(32, 32, 1);
  double *d_mat, *d_matT, *d_matSym;
  cudaMalloc(&d_mat,    SIZE*SIZE*sizeof(double));
  cudaMalloc(&d_matT,   SIZE*SIZE*sizeof(double));
  cudaMalloc(&d_matSym, SIZE*SIZE*sizeof(double));
  matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
  matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
  cudaDeviceSynchronize();
}
$ nvcc -o t12 t12.cu -arch=sm_70
$ nvprof ./t12
==40364== NVPROF is profiling process 40364, command: ./t12
==40364== Profiling application: ./t12
==40364== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  11.0379s         2  5.51893s  5.50503s  5.53282s  matrixMultiply(double*, double*, double*)
      API calls:   97.80%  11.0379s         1  11.0379s  11.0379s  11.0379s  cudaDeviceSynchronize
                    2.15%  242.74ms         3  80.914ms  1.9747ms  238.78ms  cudaMalloc
                    0.04%  4.3105ms         4  1.0776ms  1.0728ms  1.0890ms  cuDeviceTotalMem
                    0.01%  1.4584ms       404  3.6090us     110ns  163.21us  cuDeviceGetAttribute
                    0.00%  142.28us         4  35.571us  32.475us  41.980us  cuDeviceGetName
                    0.00%  51.695us         2  25.847us  7.0790us  44.616us  cudaLaunchKernel
                    0.00%  11.198us         4  2.7990us  1.6260us  5.5950us  cuDeviceGetPCIBusId
                    0.00%  1.7930us         3     597ns     149ns  1.2200us  cuDeviceGetCount
                    0.00%  1.6590us         8     207ns     120ns     704ns  cuDeviceGet
                    0.00%     733ns         4     183ns     161ns     215ns  cuDeviceGetUuid
$

CUDA 11.2，相同的 V100 机器

我们再次看到约 5.5 秒的内核持续时间，与您的报告一致。现在让我们比较一下SASS。我们使用 cubojdump utility 来提取它，在这两种情况下，很明显编译器正在以某种方式展开内核循环。关键操作是DFMA，双精度融合乘加，每次循环迭代调用一次，计算单个产品（并保持运行宁和）。因此，在展开的循环中，我将在每种情况下仅提取一小段 DFMA 操作，以了解每 FLOP 的“成本”和“开销”。

CUDA Fortran:

/*1670*/                   LDG.E.64.SYS R18, [R12+0x1b8] ;                 /* 0x0001b8000c127381 */
                                                                           /* 0x000ea200001eeb00 */
/*1680*/                   IADD3 R20, P0, R16, R29, RZ ;                   /* 0x0000001d10147210 */
                                                                           /* 0x001fc60007f1e0ff */
/*1690*/                   LDG.E.64.SYS R22, [R12+0x1b0] ;                 /* 0x0001b0000c167381 */
                                                                           /* 0x000e6400001eeb00 */
/*16a0*/                   IMAD.X R21, R17, 0x1, R27, P0 ;                 /* 0x0000000111157824 */
                                                                           /* 0x000fe400000e061b */
/*16b0*/                   LDG.E.64.SYS R24, [R16] ;                       /* 0x0000000010187381 */
                                                                           /* 0x00006c00001eeb00 */
/*16c0*/                   LDG.E.64.SYS R16, [R20] ;                       /* 0x0000000014107381 */
                                                                           /* 0x001ea200001eeb00 */
/*16d0*/                   DFMA R22, R24, R22, R14 ;                       /* 0x000000161816722b */
                                                                           /* 0x002084000000000e */
/*16e0*/                   IADD3 R14, P0, R20, R29, RZ ;                   /* 0x0000001d140e7210 */
                                                                           /* 0x001fca0007f1e0ff */
/*16f0*/                   IMAD.X R15, R21, 0x1, R27, P0 ;                 /* 0x00000001150f7824 */
                                                                           /* 0x000fe200000e061b */
/*1700*/                   DFMA R16, R16, R18, R22 ;                       /* 0x000000121010722b */

上述序列在展开的循环区域内重复。我们注意到有 2 个 DFMA 操作，4 个 LDG 操作（有意义 - 每个乘法两个）和剩余的指令 (4) 可能构成索引更新，因为循环通过其迭代进行。

CUDA C++:

/*02d0*/                   LDG.E.64.SYS R16, [R6+0x40000] ;         /* 0x0400000006107381 */
                                                                    /* 0x001f2800001eeb00 */
/*02e0*/                   LDG.E.64.SYS R24, [R4+0x10] ;            /* 0x0000100004187381 */
                                                                    /* 0x000f2200001eeb00 */
/*02f0*/                   DFMA R10, R20, R10, R22 ;                /* 0x0000000a140a722b */
                                                                    /* 0x0200860000000016 */
/*0300*/                   LDG.E.64.SYS R22, [R6+0x60000] ;         /* 0x0600000006167381 */
                                                                    /* 0x001f6800001eeb00 */
/*0310*/                   LDG.E.64.SYS R20, [R4+0x18] ;            /* 0x0000180004147381 */
                                                                    /* 0x000f6200001eeb00 */
/*0320*/                   DFMA R14, R8, R14, R10 ;                 /* 0x0000000e080e722b */

再次重复该序列。我们看到 2 DFMA 和 4 LDG 指令，但在重复序列中没有其他指令。我们注意到 LDG 指令似乎具有可以在编译时合并到指令中的偏移量，并且这些偏移量消除了对任何额外指令的需要，因为它们只是“递增”先前计算的偏移量（我们也请注意，“增量”在一种情况下似乎是按列偏移，而在另一种情况下是按行偏移，就像我们期望在内核循环中获取乘法操作数一样。)

重构

我们能否重构 Fortran 代码，使其像 C++ 代码一样工作？是：

$ cat t11a.cuf
module mmk
   USE cudafor
   !
   ! Definition of symbols for real types (RP)
   !
   IMPLICIT NONE
   !
   INTEGER, PARAMETER :: SP = SELECTED_REAL_KIND(6,   37)     ! REAL32
   INTEGER, PARAMETER :: DP = SELECTED_REAL_KIND(15, 307)     ! REAL64
   INTEGER, PARAMETER :: SIZE_ = 16384
   !

   Contains

        attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)

          integer :: tx, ty, k
          REAL(DP) :: accum
          REAL(DP), dimension(:) :: d_mat, d_matT, d_matSym

          tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
          ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y

          if (tx <= SIZE_ .and. ty <=SIZE_) then
            accum = 0.0
            do k=1, SIZE_
              accum = accum + d_mat((ty-1)*SIZE_+k) * d_matT((k-1)*SIZE_+tx)
            end do
            d_matSym((ty-1)*SIZE_+tx) = accum
          end if
        end subroutine matrixMultiply


end module mmk

PROGRAM Test
   !
   ! This is the main program for Test
   !
   USE cudafor
   USE mmk

   !
   IMPLICIT NONE
   !
   REAL(DP), ALLOCATABLE, DEVICE, DIMENSION(:)     :: d_mat, d_matT, d_matSym
   !
   INTEGER    :: err, i1, i2
   type(dim3) :: grid_dim, blk_dim
   !
   ! Allocate storage for the arrays
   !
   Allocate(d_mat(SIZE_*SIZE_),d_matT(SIZE_*SIZE_),d_matSym(SIZE_*SIZE_))
   !
   ! invoke the kernel
   !

   !Call
   grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
   blk_dim = dim3(32, 32, 1)
   call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
   call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
   err = cudaDeviceSynchronize()


   !
   ! Free storage for the arrays
   !
   Deallocate(d_mat,d_matT,d_matSym)
   !
END PROGRAM Test
$ nvfortran t11a.cuf -o t11a
$ nvprof ./t11a
==45544== NVPROF is profiling process 45544, command: ./t11a
==45544== Profiling application: ./t11a
==45544== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  10.8169s         2  5.40847s  5.39118s  5.42576s  mmk_matrixmultiply_
                    0.00%  9.2160us         6  1.5360us  1.3120us  2.1440us  [CUDA memcpy HtoD]
      API calls:   96.72%  10.8240s         9  1.20266s  12.102us  5.42594s  cudaFree
                    3.22%  360.34ms         9  40.038ms  3.2760us  355.64ms  cudaMalloc
                    0.04%  4.3488ms         4  1.0872ms  1.0598ms  1.1593ms  cuDeviceTotalMem
                    0.02%  2.3633ms       404  5.8490us     111ns  869.22us  cuDeviceGetAttribute
                    0.00%  144.50us         4  36.125us  33.254us  41.317us  cuDeviceGetName
                    0.00%  106.43us         6  17.737us  4.7910us  51.453us  cudaMemcpy
                    0.00%  101.46us         2  50.732us  44.479us  56.985us  cudaLaunchKernel
                    0.00%  20.926us         1  20.926us  20.926us  20.926us  cudaDeviceSynchronize
                    0.00%  7.2850us         4  1.8210us     491ns  4.6210us  cuDeviceGetPCIBusId
                    0.00%  1.7650us         8     220ns     118ns     672ns  cuDeviceGet
                    0.00%     926ns         4     231ns     218ns     264ns  cuDeviceGetUuid
                    0.00%     614ns         3     204ns     130ns     310ns  cuDeviceGetCount
$

而且我们看到性能与 CUDA C++ 版本大致相同。

备注：

一个。我省略了最终变体的 SASS 分析。练习就留给了reader。然而，展开的循环部分是 LDG-LDG-DFMA 的重复序列，正如在 CUDA C++ 版本中看到的那样。

乙。你可能有两个问题：1.为什么会这样？我试图回答这个问题。 2. 应该是这个 way/does 需要这样吗？我没有回答这个问题，我相信这是一些我不想做的 Fortran 语句的组合，也是编译器工程师的一个问题。如果您认为 2D 实现应该复制 1D 实现性能，我建议 filing a bug。

C。在我注意到你在 C++ 的情况下而不是 fortran 的情况下这样做之前，我有点头疼：

int tx = blockIdx.y * blockDim.y + threadIdx.y;
     ^            ^            ^             ^

D.我认为这种差异在编译时无法轻松解决的原因之一是 CUDA Fortran 中的编译器 2D 数组处理（无论是否需要）似乎涉及创建运行时间元数据包其中包括数组宽度，可能还有其他信息。这个元数据包与阵列数据有效负载分开传输到设备 - 这可以从上面的分析器输出中推测出来，或者通过运行 --print-gpu-trace 更清楚地看到它。在您的 CUDA C++ 案例中，实际上，数组宽度 (SIZE) 在编译时已知。在 Fortran 的情况下，它不是（或者，如果是，编译器似乎没有通过内核代码生成传播这种知识）。您可以看到上面的“改进”寻址模式建议在编译时已知的偏移量。这个“应该”是否可以通过对 Fortran 进行更仔细的处理来解决，我不能说。但是我希望某种 contiguous declaration/decoration、本身不足以解决编译器的这种差异。该声明本身并不能消除对基于数组宽度的动态运行时间索引计算的需要。

Fortran 上的 CUDA 矩阵乘法比 C 慢

CUDA Matrix Multiply on Fortran is slower than C

fortran

cuda