为什么我的程序在访问 4K 偏移数组元素时很慢？

Question

我写了一个程序来获取计算机的缓存和缓存行大小，但得到的结果我无法解释，谁能帮我解释一下？

这是我的程序，access_array()以不同的步长遍历数组，我测量了这些步长的执行时间。

// Program to calculate L1 cache line size, compile in g++ -O1

#include <iostream>
#include <string>
#include <sys/time.h>
#include <cstdlib>
using namespace std;

#define ARRAY_SIZE (256 * 1024) // arbitary array size, must in 2^N to let module work

void access_array(char* arr, int steps)
{
    const int loop_cnt = 1024 * 1024 * 32; // arbitary loop count
    int idx = 0;
    for (int i = 0; i < loop_cnt; i++)
    {
        arr[idx] += 10;
        idx = (idx + steps) & (ARRAY_SIZE - 1); // if use %, the latency will be too high to see the gap
    }
}

int main(int argc, char** argv){
    double cpu_us_used;
    struct timeval  start, end;

    for(int step = 1 ; step <= ARRAY_SIZE ; step *= 2){
        char* arr = new char[ARRAY_SIZE];
        for(int i = 0 ; i < ARRAY_SIZE ; i++){
            arr[i] = 0;
        }

        gettimeofday(&start, NULL); // get start clock

        access_array(arr, step);

        gettimeofday(&end, NULL);   // get end clock

        cpu_us_used = 1000000 * (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec);

        cout << step << " , " << cpu_us_used << endl;

        delete[] arr;
    }
    return 0;
}

结果

我的问题是：

从64到512，我无法解释为什么执行时间几乎相同，为什么有一个线性从 1K 增长到 4K?

这是我的假设。

对于步骤 = 1，每 64 次迭代都会导致 1 缓存行未命中 。在 32K 次迭代后，L1 缓存已满，因此我们每 64 就会发生 L1 碰撞和容量未命中 ] 迭代次数。
For step = 64, every 1 iteration cause 1 cache line miss。在 512 次迭代之后，L1 缓存已满，因此我们每 1[=87= 就会发生 L1 碰撞和容量未命中 ] 迭代。

因此step = 32和64之间有差距

通过观察第一个间隙，我可以得出结论，L1 缓存行大小为 64 字节。

For step = 512, every 1 iteration cause 1 cache line miss。而经过64次迭代后，L1 Cache的Set 0,8,16,24,32,40,48,56已经满了，所以我们每 1 次迭代有 L1 碰撞未命中 。

For step = 4K, every 1 iteration cause 1 cache line miss。而在8次迭代后，L1 Cache的set 0已满，所以我们有L1 collision miss每 1 次迭代。

对于128到4K的情况，都发生了L1碰撞未命中，并且不同之处在于，随着步骤的增加，我们更早地开始了碰撞未命中。

我唯一能想到的是还有其他机制（可能是页面、TLB 等）会影响执行时间。

这是我工作站的缓存大小和 CPU 信息。顺便说一下，我的 PC 上也有运行这个程序，我得到了类似的结果。

平台：Intel Xeon(R) CPU E5-2667 0 @ 2.90GHz

LEVEL1_ICACHE_SIZE 32768 LEVEL1_ICACHE_ASSOC 8 LEVEL1_ICACHE_LINESIZE 64 LEVEL1_DCACHE_SIZE 32768 LEVEL1_DCACHE_ASSOC 8 LEVEL1_DCACHE_LINESIZE 64 LEVEL2_CACHE_SIZE 262144 LEVEL2_CACHE_ASSOC 8 LEVEL2_CACHE_LINESIZE 64 LEVEL3_CACHE_SIZE 15728640 LEVEL3_CACHE_ASSOC 20 LEVEL3_CACHE_LINESIZE 64 LEVEL4_CACHE_SIZE 0 LEVEL4_CACHE_ASSOC 0 LEVEL4_CACHE_LINESIZE 0

Answer 1

这个CPU可能有：

一个硬件缓存行预取器，它检测同一物理 4 KiB 页面内的线性访问模式，并在进行访问之前预取它们。这会在 4 KiB 边界处停止预取（因为物理地址可能非常不同且未知）。
硬件 TLB 预取器，检测 TLB 使用中的线性访问模式并预取 TLB 条目。

从 1 到 16，缓存行预取器正在执行它的工作，在您访问它们之前获取缓存行，因此执行时间保持不变（不受缓存未命中的影响）。

在 32，缓存行预取器开始挣扎（由于 "stop at 4 KiB page boundary" 事情）。

从 64 到 512，TLB 预取器正在执行它的工作，在您访问它们之前获取 TLB 条目，因此执行时间保持不变（不受 TLB 未命中的影响）。

从 512 到 4096，TLB 预取器未能跟上。 CPU 会在每次“4096/step”访问时等待 TLB 信息；这些停顿导致执行时间 "linear-ish" 增长。

从 4096 到 131072；我想假设 "new char[ARRAY_SIZE];" 分配了这么多 space 以至于图书馆 and/or OS 决定给你 2 MiB 页面 and/or 1 GiB 页面，消除了一些 TLB 未命中，并随着被访问的页面数量的减少而缩短了执行时间。

对于"larger than 131072"；我假设您开始看到“1 GiB 页 TLB 未命中”的影响。

请注意，从 CPUID 指令。您使用的方法更适合测量缓存延迟（从其中一个缓存获取数据需要多长时间）。

还有；为了减少 TLB 干扰，OS 可能允许您明确请求 1 GiB 页面（例如 Linux 上的 mmap(..., MAP_POPULATE | MAP_HUGE_1GB, ... )）；在开始测量之前，您可以通过 "touch then CLFLUSH" 热身循环来 "pre-warm" TLB。可以通过禁用硬件缓存行预取器。 MSR 中的标志（如果您有权限），或者可以通过使用 "random"（不可预测的）访问模式来击败。

Answer 2

终于找到答案了

我尝试将数组大小设置为 16KiB 或更小，但它的步长也很慢 = 4KiB。

另一方面，我尝试将步长的偏移量从每次迭代中乘以 2 更改为每次迭代中加 1，但当 step = 4KiB 时它仍然变慢。

代码

#define ARRAY_SIZE (4200) 

void access_array(char* arr, int steps)
{
    const int loop_cnt = 1024 * 1024 * 32; // arbitary loop count
    int idx = 0;
    for (int i = 0; i < loop_cnt; i++)
    {
        arr[idx] += 10;
        idx = idx + steps;
        if(idx >= ARRAY_SIZE)
            idx = 0;
    }

}

for(int step = 4090 ; step <= 4100 ; step ++){
    char* arr = new char[ARRAY_SIZE];
    for(int i = 0 ; i < ARRAY_SIZE ; i++){
        arr[i] = 0;
    }

    gettimeofday(&start, NULL); // get start clock

    access_array(arr, step);

    gettimeofday(&end, NULL);   // get end clock

    cpu_us_used = 1000000 * (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec);

    cout << step << " , " << cpu_us_used << endl;

    delete[] arr;
}

结果

4090 , 48385
4091 , 48497
4092 , 48136
4093 , 48520
4094 , 48090
4095 , 48278
4096 , **51818**
4097 , 48196
4098 , 48600
4099 , 48185
4100 , 63149

因此，我怀疑它与任何缓存/TLB/预取机制无关。

随着越来越多的谷歌搜索性能与幻数“4K”之间的关系，我发现英特尔平台上的4K aliasing problem会降低负载性能。

This occurs when a load is issued after a store and their memory addresses are offset by (4K). When this is processed in the pipeline, the issue of the load will match the previous store (the full address is not used at this point), so pipeline will try to forward the results of the store and avoid doing the load (this is store forwarding). Later on when the address of the load is fully resolved, it will not match the store, and so the load will have to be re-issued from a later point in the pipe. This has a 5-cycle penalty in the normal case, but could be worse in certain situations, like with un-aligned loads that span 2 cache lines.

为什么我的程序在访问 4K 偏移数组元素时很慢？

Why is my program slow when access 4K offset array element?

performance

caching

x86-64

microbenchmark

cpu-cache