使用 C++ 中的 OpenMP，矩阵乘法的性能保持不变

Question

auto t1 = chrono::steady_clock::now();
     #pragma omp parallel
     {

        for(int i=0;i<n;i++)
        {
            #pragma omp for collapse(2)
            for(int j=0;j<n;j++)
            {

                for(int k=0;k<n;k++)
                {
                    C[i][j]+=A[i][k]*B[k][j];
                }

            }
        }
     }
auto t2 = chrono::steady_clock::now();

auto t = std::chrono::duration_cast<chrono::microseconds>( t2 - t1 ).count();

无论是否进行并行化，变量 t 都保持相当恒定。我不确定为什么会这样。也有一段时间 t 输出为 0。我面临的另一个问题是，如果我将 n 的值增加到 500 之类的值，编译器将无法运行程序。（这里我取 n=100）我正在使用 code::blocks 和 GNU GCC 编译器。

Answer 1

collapse 指令实际上可能对此负责，因为索引 j 是使用 divide/mod operations 重新创建的。

你试过没有collapse吗？

Answer 2

建议的 OpenMP 并行化不正确，可能会导致错误的结果。当指定 collapse(2) 时，线程执行 "simultaneously" (j,k) 次迭代。如果两个（或多个）线程在相同的 j 但不同的 k 上工作，它们会将 A[i][k]*B[k][j] 的结果累积到相同的数组位置 C[i][j]。这就是所谓的竞争条件，即 "two or more threads can access shared data and they try to change it at the same time" (What is a race condition?)。 尽管代码不是 OpenMP 有效的，但数据竞争不一定会导致错误的结果，并且会根据多种因素（调度、编译器实现、线程数等）产生错误的结果。为了解决上面代码中的问题，OpenMP 提供了 reduction 子句：

#pragma omp parallel
    {
        for(int i=0;i<n;i++)  {
            #pragma omp for collapse(2) reduction(+:C)
            for(int j=0;j<n;j++)  {
                for(int k=0;k<n;k++) {
                     C[i][j]+=A[i][k]*B[k][j];

所以 "a private copy is created in each implicit task (...) and is initialized with the initializer value of the reduction-identifier. After the end of the region, the original list item is updated with the values of the private copies using the combiner associated with the reduction-identifier" (http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf). Note that the reduction on arrays in C is directly supported by the standard since OpenMP 4.5 (check if the compiler support it, otherwise there are old manual ways to achieve it, Reducing on array in OpenMp).

然而，对于给定的代码，避免最内层循环的并行化可能更合适，这样根本不需要缩减：

#pragma omp parallel
    {
        #pragma omp for collapse(2)
        for(int i=0;i<n;i++)  {
            for(int j=0;j<n;j++)  {
                for(int k=0;k<n;k++) {
                     C[i][j]+=A[i][k]*B[k][j];

对于小尺寸的矩阵 and/or 少量线程，Serial 可以比 OpenMP 版本更快。在我使用多达 16 个内核、n=1000、GNU 编译器 v6.1 的英特尔机器上，当激活 -O3 优化时，盈亏平衡点约为 4 个内核，而使用 -O0 编译时，盈亏平衡点约为 2 个内核。为了清楚起见，我报告了我测量的性能：

Serial      418020
----------- WRONG ORIG -- +REDUCTION -- OUTER.COLLAPSE -- OUTER.NOCOLLAPSE -
OpenMP-1   1924950        2841993        1450686          1455989
OpenMP-2    988743        2446098         747333           745830
OpenMP-4    515266        3182262         396524           387671
OpenMP-8    280285        5510023         219506           211913  
OpenMP-16  2227567       10807828         150277           123368

使用减少性能损失是巨大的（反向加速）。外层并行化（w or w/o collapse）是最好的选择。

关于大型矩阵的失败，可能的原因与可用堆栈的大小有关。尝试扩大系统和 OpenMP 堆栈大小，即

ulimit -s unlimited
export OMP_STACKSIZE=10000000

使用 C++ 中的 OpenMP，矩阵乘法的性能保持不变

Performance of matrix multiplications remains unchanged with OpenMP in C++

c++

codeblocks

openmp