为什么我的 4 线程实现不比单线程实现快?

Why isn't my 4 thread implementation faster than the single thread one?

我对多线程知之甚少,我也不知道为什么会这样,所以我直奔主题。

我正在处理图像并将图像分成 4 部分并将每个部分传递给每个线程(基本上我传递每个部分的第一个和最后一个像素行的索引)。例如,如果图像有 1000 行,每个线程将处理其中的 250 行。我可以详细介绍我的实施以及我正在努力实现的目标,以防对您有所帮助。现在,我提供线程执行的代码,以防您发现发生这种情况的原因。我不知道它是否相关,但在这两种情况下(1 个线程或 4 个线程),该过程大约需要 15 毫秒并且 pfUMappbUMap 是无序映射。

void jacobiansThread(int start, int end,vector<float> &sJT,vector<float> &sJTJ) {

    uchar* rgbPointer;
    float* depthPointer;
    float* sdfPointer;
    float* dfdxPointer; float* dfdyPointer;
    float fov = radians(45.0);
    float aspect = 4.0 / 3.0;
    float focal = 1 / (glm::tan(fov / 2));
    float fu = focal * cols / 2 / aspect;
    float fv = focal * rows / 2;

    float strictFu = focal / aspect;
    float strictFv = focal;

    vector<float> pixelJacobi(6, 0);

    for (int y = start; y <end; y++) {
        rgbPointer = sceneImage.ptr<uchar>(y);
        depthPointer = depthBuffer.ptr<float>(y);
        dfdxPointer = dfdx.ptr<float>(y);
        dfdyPointer = dfdy.ptr<float>(y);
        sdfPointer = sdf.ptr<float>(y);
        for (int x = roiX.x; x <roiX.y; x++) {
            float deltaTerm;// = deltaPointer[x];
            float raw = sdfPointer[x];
            if (raw > 8.0)continue;
            float dirac = (1.0f / float(CV_PI)) * (1.2f / (raw * 1.44f * raw + 1.0f));
            deltaTerm = dirac;
            vec3 rgb(rgbPointer[x * 3], rgbPointer[x * 3+1], rgbPointer[x * 3+2]);
            vec3 bin = rgbToBin(rgb, numberOfBins);
            int indexOfColor = bin.x * numberOfBins * numberOfBins + bin.y * numberOfBins + bin.z;
            float s3 = glfwGetTime();
            float pF = pfUMap[indexOfColor];
            float pB = pbUMap[indexOfColor];
            float heavisideTerm;
            heavisideTerm = HEAVISIDE(raw);
            float denominator = (heavisideTerm * pF + (1 - heavisideTerm) * pB) + 0.000001;
            float commonFirstTerm = -(pF - pB) / denominator * deltaTerm;
            if (pF == pB)continue;
            vec3 pixel(x, y, depthPointer[x]);

            float dfdxTerm = dfdxPointer[x];
            float dfdyTerm = -dfdyPointer[x];

            if (pixel.z == 1) {
                cv::Point c = findClosestContourPoint(cv::Point(x, y), dfdxTerm, -dfdyTerm, abs(raw));
                if (c.x == -1)continue;
                pixel = vec3(c.x, c.y, depthBuffer.at<float>(cv::Point(c.x, c.y)));
            }

            vec3 point3D = pixel;
            pixelToViewFast(point3D, cols, rows, strictFu, strictFv);


            float Xc = point3D.x; float Xc2 = Xc * Xc; float Yc = point3D.y; float Yc2 = Yc * Yc; float Zc = point3D.z; float Zc2 = Zc * Zc;
            pixelJacobi[0] = dfdyTerm * ((fv * Yc2) / Zc2 + fv) + (dfdxTerm * fu * Xc * Yc) / Zc2;
            pixelJacobi[1] = -dfdxTerm * ((fu * Xc2) / Zc2 + fu) - (dfdyTerm * fv * Xc * Yc) / Zc2;
            pixelJacobi[2] = -(dfdyTerm * fv * Xc) / Zc + (dfdxTerm * fu * Yc) / Zc;
            pixelJacobi[3] = -(dfdxTerm * fu) / Zc;
            pixelJacobi[4] = -(dfdyTerm * fv) / Zc;
            pixelJacobi[5] = (dfdyTerm * fv * Yc) / Zc2 + (dfdxTerm * fu * Xc) / Zc2;

            float weightingTerm = -1.0 / log(denominator);
            for (int i = 0; i < 6; i++) {
                pixelJacobi[i] *= commonFirstTerm;
                sJT[i] += pixelJacobi[i];
            }
            for (int i = 0; i < 6; i++) {
                for (int j = i; j < 6; j++) {
                    sJTJ[i * 6 + j] += weightingTerm * pixelJacobi[i] * pixelJacobi[j];
                }
            }

        }
    }
}

这是我调用每个线程的部分:

vector<std::thread> myThreads;
    float step = (roiY.y - roiY.x) / numberOfThreads;
    vector<vector<float>> tsJT(numberOfThreads, vector<float>(6, 0));
    vector<vector<float>> tsJTJ(numberOfThreads, vector<float>(36, 0));
    for (int i = 0; i < numberOfThreads; i++) {
        int start = roiY.x+i * step;
        int end = start + step;
        if (end > roiY.y)end = roiY.y;
        myThreads.push_back(std::thread(&pwp3dV2::jacobiansThread, this,start,end,std::ref(tsJT[i]), std::ref(tsJTJ[i])));
    }

    vector<float> sJT(6, 0);
    vector<float> sJTJ(36, 0);
    for (int i = 0; i < numberOfThreads; i++)myThreads[i].join();

其他注意事项

为了测量时间,我在第二个代码片段之前和之后使用了 glfwGetTime()。测量值各不相同,但正如我提到的,两种实现的平均值约为 15 毫秒。

纯粹的猜测,但有两件事可能会阻止并行化的全部功能。

  1. 处理速度受内存总线限制。核心将等到数据加载后再继续。
  2. 核心之间的数据共享。一些缓存是特定于核心的。如果内核之间共享内存,则数据必须在加载之前向下遍历到共享缓存。

在 Linux 上,您可以使用 Perf 检查缓存未命中。

如果你想要更好的时间,你需要从一个计数器中分离一个循环运行,为此你需要做一些预处理。一些快速的东西,比如为每个段左右制作一个结构数组 headers 。如果说你不介意更好的事情,你可以 vector<int> 使用计数器的值。然后做 for_each(std::execution::par,...) 。快多了。
对于计时有

auto t2 = std::chrono::system_clock::now();
std::chrono::milliseconds f = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1);

启动一个线程会产生很大的开销,如果您只有 15 毫秒的工作时间,这可能不值得花时间。

常见的解决方案是将线程 运行 保留在后台并在需要时向它们发送数据,而不是每次有工作时调用 std::thread 构造函数创建一个新线程要做。