read/write 使用大循环到大数组 - 执行时间问题

read/write to large array using large loop - execution time concerns

所以最近我 运行 遇到了一个我认为很有趣但我无法完全解释的问题。我在以下代码中强调了问题的性质:

#include <cstring>
#include <chrono> 
#include <iostream> 

#define NLOOPS 10

void doWorkFast(int total, int *write, int *read)
{
    for (int j = 0; j < NLOOPS; j++) {
        for (int i = 0; i < total; i++) {
            write[i] = read[i] + i;
        }
    }
}

void doWorkSlow(int total, int *write, int *read, int innerLoopSize)
{
    for (int i = 0; i < NLOOPS; i++) {
        for (int j = 0; j < total/innerLoopSize; j++) {
            for (int k = 0; k < innerLoopSize; k++) {
                write[j*k + k] = read[j*k + k] + j*k + k;
            }
        }
    }
}


int main(int argc, char *argv[])
{
    int n = 1000000000;
    
    int *heapMemoryWrite = new int[n];
    int *heapMemoryRead = new int[n];
    

    for (int i = 0; i < n; i++)
    {
        heapMemoryRead[i] = 1;
    }

    std::memset(heapMemoryWrite, 0, n * sizeof(int));   

    auto start1 = std::chrono::high_resolution_clock::now();

    doWorkFast(n,heapMemoryWrite, heapMemoryRead);
    

    auto finish1 = std::chrono::high_resolution_clock::now();  
    auto duration1 = std::chrono::duration_cast<std::chrono::microseconds>(finish1 - start1); 

    for (int i = 0; i < n; i++)
    {
        heapMemoryRead[i] = 1;
    }

    std::memset(heapMemoryWrite, 0, n * sizeof(int));

    auto start2 = std::chrono::high_resolution_clock::now();
    
    doWorkSlow(n,heapMemoryWrite, heapMemoryRead, 10);


    auto finish2 = std::chrono::high_resolution_clock::now();  
    auto duration2 = std::chrono::duration_cast<std::chrono::microseconds>(finish2 - start2); 

    std::cout << "Small inner loop:" << duration1.count() << " microseconds.\n" << 
                 "Large inner loop:" << duration2.count() << " microseconds." << std::endl; 

    delete[] heapMemoryWrite;
    delete[] heapMemoryRead;
}

查看两个 doWork* 函数,对于每次迭代,我们都在读取相同的地址并添加相同的值并写入相同的地址。我知道在 doWorkSlow 实现中,我们正在执行一两个操作来解析 j*k + k,但是,我认为相对于执行 [=27] 所花费的时间,我认为这是相当安全的=]对于内存读写,这些操作的时间贡献可以忽略不计

然而,doWorkSlow 与使用 g++ --version 7.5.0 的 i7-3700 上的 doWorkFast(25.5 秒)相比,doWorkSlow 花费的时间大约是 doWorkFast(25.5 秒)的两倍。虽然缓存预取和 b运行ch 预测之类的事情浮现在脑海中,但我没有很好的解释为什么 doWorkFastdoWorkSlow 快得多。有没有人有见识?

谢谢

Looking at the two doWork* functions, for every iteration, we are reading the same addresses adding the same value and writing to the same addresses.

这不是真的!

doWorkFast 中,您逐步索引每个整数,如 array[i]

array[0]
array[1]
array[2]
array[3]

doWorkSlow 中,您将每个整数索引为 array[j*k + k],它会跳来跳去并重复。

例如,当 j 为 10 时,您从 0 开始迭代 k,您正在访问

array[0]    // 10*0+0
array[11]   // 10*1+1
array[22]   // 10*2+2
array[33]   // 10*3+3

这将阻止您的优化器使用可同时对许多相邻整数进行运算的指令。