read/write 使用大循环到大数组 - 执行时间问题
read/write to large array using large loop - execution time concerns
所以最近我 运行 遇到了一个我认为很有趣但我无法完全解释的问题。我在以下代码中强调了问题的性质:
#include <cstring>
#include <chrono>
#include <iostream>
#define NLOOPS 10
void doWorkFast(int total, int *write, int *read)
{
for (int j = 0; j < NLOOPS; j++) {
for (int i = 0; i < total; i++) {
write[i] = read[i] + i;
}
}
}
void doWorkSlow(int total, int *write, int *read, int innerLoopSize)
{
for (int i = 0; i < NLOOPS; i++) {
for (int j = 0; j < total/innerLoopSize; j++) {
for (int k = 0; k < innerLoopSize; k++) {
write[j*k + k] = read[j*k + k] + j*k + k;
}
}
}
}
int main(int argc, char *argv[])
{
int n = 1000000000;
int *heapMemoryWrite = new int[n];
int *heapMemoryRead = new int[n];
for (int i = 0; i < n; i++)
{
heapMemoryRead[i] = 1;
}
std::memset(heapMemoryWrite, 0, n * sizeof(int));
auto start1 = std::chrono::high_resolution_clock::now();
doWorkFast(n,heapMemoryWrite, heapMemoryRead);
auto finish1 = std::chrono::high_resolution_clock::now();
auto duration1 = std::chrono::duration_cast<std::chrono::microseconds>(finish1 - start1);
for (int i = 0; i < n; i++)
{
heapMemoryRead[i] = 1;
}
std::memset(heapMemoryWrite, 0, n * sizeof(int));
auto start2 = std::chrono::high_resolution_clock::now();
doWorkSlow(n,heapMemoryWrite, heapMemoryRead, 10);
auto finish2 = std::chrono::high_resolution_clock::now();
auto duration2 = std::chrono::duration_cast<std::chrono::microseconds>(finish2 - start2);
std::cout << "Small inner loop:" << duration1.count() << " microseconds.\n" <<
"Large inner loop:" << duration2.count() << " microseconds." << std::endl;
delete[] heapMemoryWrite;
delete[] heapMemoryRead;
}
查看两个 doWork* 函数,对于每次迭代,我们都在读取相同的地址并添加相同的值并写入相同的地址。我知道在 doWorkSlow
实现中,我们正在执行一两个操作来解析 j*k + k
,但是,我认为相对于执行 [=27] 所花费的时间,我认为这是相当安全的=]对于内存读写,这些操作的时间贡献可以忽略不计
然而,doWorkSlow
与使用 g++ --version
7.5.0 的 i7-3700 上的 doWorkFast
(25.5 秒)相比,doWorkSlow
花费的时间大约是 doWorkFast
(25.5 秒)的两倍。虽然缓存预取和 b运行ch 预测之类的事情浮现在脑海中,但我没有很好的解释为什么 doWorkFast
比 doWorkSlow
快得多。有没有人有见识?
谢谢
Looking at the two doWork* functions, for every iteration, we are reading the same addresses adding the same value and writing to the same addresses.
这不是真的!
在 doWorkFast
中,您逐步索引每个整数,如 array[i]
。
array[0]
array[1]
array[2]
array[3]
在 doWorkSlow
中,您将每个整数索引为 array[j*k + k]
,它会跳来跳去并重复。
例如,当 j
为 10 时,您从 0 开始迭代 k
,您正在访问
array[0] // 10*0+0
array[11] // 10*1+1
array[22] // 10*2+2
array[33] // 10*3+3
这将阻止您的优化器使用可同时对许多相邻整数进行运算的指令。
所以最近我 运行 遇到了一个我认为很有趣但我无法完全解释的问题。我在以下代码中强调了问题的性质:
#include <cstring>
#include <chrono>
#include <iostream>
#define NLOOPS 10
void doWorkFast(int total, int *write, int *read)
{
for (int j = 0; j < NLOOPS; j++) {
for (int i = 0; i < total; i++) {
write[i] = read[i] + i;
}
}
}
void doWorkSlow(int total, int *write, int *read, int innerLoopSize)
{
for (int i = 0; i < NLOOPS; i++) {
for (int j = 0; j < total/innerLoopSize; j++) {
for (int k = 0; k < innerLoopSize; k++) {
write[j*k + k] = read[j*k + k] + j*k + k;
}
}
}
}
int main(int argc, char *argv[])
{
int n = 1000000000;
int *heapMemoryWrite = new int[n];
int *heapMemoryRead = new int[n];
for (int i = 0; i < n; i++)
{
heapMemoryRead[i] = 1;
}
std::memset(heapMemoryWrite, 0, n * sizeof(int));
auto start1 = std::chrono::high_resolution_clock::now();
doWorkFast(n,heapMemoryWrite, heapMemoryRead);
auto finish1 = std::chrono::high_resolution_clock::now();
auto duration1 = std::chrono::duration_cast<std::chrono::microseconds>(finish1 - start1);
for (int i = 0; i < n; i++)
{
heapMemoryRead[i] = 1;
}
std::memset(heapMemoryWrite, 0, n * sizeof(int));
auto start2 = std::chrono::high_resolution_clock::now();
doWorkSlow(n,heapMemoryWrite, heapMemoryRead, 10);
auto finish2 = std::chrono::high_resolution_clock::now();
auto duration2 = std::chrono::duration_cast<std::chrono::microseconds>(finish2 - start2);
std::cout << "Small inner loop:" << duration1.count() << " microseconds.\n" <<
"Large inner loop:" << duration2.count() << " microseconds." << std::endl;
delete[] heapMemoryWrite;
delete[] heapMemoryRead;
}
查看两个 doWork* 函数,对于每次迭代,我们都在读取相同的地址并添加相同的值并写入相同的地址。我知道在 doWorkSlow
实现中,我们正在执行一两个操作来解析 j*k + k
,但是,我认为相对于执行 [=27] 所花费的时间,我认为这是相当安全的=]对于内存读写,这些操作的时间贡献可以忽略不计
然而,doWorkSlow
与使用 g++ --version
7.5.0 的 i7-3700 上的 doWorkFast
(25.5 秒)相比,doWorkSlow
花费的时间大约是 doWorkFast
(25.5 秒)的两倍。虽然缓存预取和 b运行ch 预测之类的事情浮现在脑海中,但我没有很好的解释为什么 doWorkFast
比 doWorkSlow
快得多。有没有人有见识?
谢谢
Looking at the two doWork* functions, for every iteration, we are reading the same addresses adding the same value and writing to the same addresses.
这不是真的!
在 doWorkFast
中,您逐步索引每个整数,如 array[i]
。
array[0]
array[1]
array[2]
array[3]
在 doWorkSlow
中,您将每个整数索引为 array[j*k + k]
,它会跳来跳去并重复。
例如,当 j
为 10 时,您从 0 开始迭代 k
,您正在访问
array[0] // 10*0+0
array[11] // 10*1+1
array[22] // 10*2+2
array[33] // 10*3+3
这将阻止您的优化器使用可同时对许多相邻整数进行运算的指令。