使用时间戳计数器的内存延迟测量

Question

我编写了以下代码，它首先刷新两个数组元素，然后尝试读取元素以测量 hit/miss 延迟。

#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>
#include <time.h>
int main()
{
    /* create array */
    int array[ 100 ];
    int i;
    for ( i = 0; i < 100; i++ )
        array[ i ] = i;   // bring array to the cache

    uint64_t t1, t2, ov, diff1, diff2, diff3;

    /* flush the first cache line */
    _mm_lfence();
    _mm_clflush( &array[ 30 ] );
    _mm_clflush( &array[ 70 ] );
    _mm_lfence();

    /* READ MISS 1 */
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    int tmp = array[ 30 ];   // read the first elemet => cache miss
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();

    diff1 = t2 - t1;        // two fence statements are overhead
    printf( "tmp is %d\ndiff1 is %lu\n", tmp, diff1 );

    /* READ MISS 2 */
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    tmp = array[ 70 ];      // read the second elemet => cache miss (or hit due to prefetching?!)
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();

    diff2 = t2 - t1;        // two fence statements are overhead
    printf( "tmp is %d\ndiff2 is %lu\n", tmp, diff2 );


    /* READ HIT*/
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    tmp = array[ 30 ];   // read the first elemet => cache hit
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();

    diff3 = t2 - t1;        // two fence statements are overhead
    printf( "tmp is %d\ndiff3 is %lu\n", tmp, diff3 );


    /* measuring fence overhead */
    _mm_lfence();
    t1 = __rdtsc();
    _mm_lfence();
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();
    ov = t2 - t1;

    printf( "lfence overhead is %lu\n", ov );
    printf( "cache miss1 TSC is %lu\n", diff1-ov );
    printf( "cache miss2 (or hit due to prefetching) TSC is %lu\n", diff2-ov );
    printf( "cache hit TSC is %lu\n", diff3-ov );


    return 0;
}

输出为

# gcc -O3 -o simple_flush simple_flush.c
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 529
tmp is 70
diff2 is 222
tmp is 30
diff3 is 46
lfence overhead is 32
cache miss1 TSC is 497
cache miss2 (or hit due to prefetching) TSC is 190
cache hit TSC is 14
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 486
tmp is 70
diff2 is 276
tmp is 30
diff3 is 46
lfence overhead is 32
cache miss1 TSC is 454
cache miss2 (or hit due to prefetching) TSC is 244
cache hit TSC is 14
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 848
tmp is 70
diff2 is 222
tmp is 30
diff3 is 46
lfence overhead is 34
cache miss1 TSC is 814
cache miss2 (or hit due to prefetching) TSC is 188
cache hit TSC is 12

读取 array[70] 的输出存在一些问题。 TSC 既不会命中也不会错过。我已经刷新了类似于 array[30] 的项目。一种可能是当array[40]被访问时，HW预取器带来了array[70]。所以，这应该是一个打击。然而，TSC 远不止一炮而红。当我第二次尝试读取array[30]时，您可以验证命中的TSC约为20。

甚至，如果array[70]没有预取，TSC应该类似于缓存未命中。

有什么理由吗？

更新 1：

为了读取数组，我按照 Peter 和 Hadi 的建议尝试了 (void) *((int*)array+i)。

在输出中我看到很多负面结果。我的意思是开销似乎大于 (void) *((int*)array+i)

更新 2：

我忘了添加 volatile。结果现在有意义了。

Answer 1

一些想法：

也许 a[70] 被预取到 L1 之外的某个级别的缓存中？
也许 DRAM 中的某些优化导致此访问速度很快，例如，可能行缓冲区在访问 a[30] 后保持打开状态。

您应该调查除 a[30] 和 a[70] 之外的其他访问权限，看看您是否获得了不同的号码。例如。命中 a[30] 后跟 a[31] 的时间是否相同（如果您使用具有 64 字节对齐的 aligned_alloc，则应在与 a[30] 相同的行中获取）。其他元素如 a[69] 和 a[71] 是否给出与 a[70] 相同的时间？

Answer 2

首先，请注意在测量 diff1 和 diff2 之后对 printf 的两次调用可能会扰乱 L1D 甚至 L2 的状态。在我的系统上，printf 的报告值 diff3-ov 范围在 4-48 个周期之间（我配置了我的系统，使 TSC 频率大约等于核心频率）。最常见的值是 L2 和 L3 延迟的值。如果报告的值为 8，那么我们的 L1D 缓存已命中。如果它大于 8，那么很可能之前对 printf 的调用已经从 L1D 和可能的 L2（在极少数情况下是 L3！）中踢出目标缓存行，这将解释测量高于 8 的延迟。@PeterCordes 有使用 (void) *((volatile int*)array + i) 而不是 temp = array[i]; printf(temp)。进行此更改后，我的实验表明 diff3-ov 的大多数报告测量值恰好是 8 个周期（这表明测量误差约为 4 个周期），报告的其他值仅有 0、4 和 12 .所以强烈推荐Peter的做法

一般来说，主内存访问延迟取决于很多因素，包括 MMU 缓存的状态和页面 table walkers 对数据缓存的影响、核心频率、非核心频率、内存控制器和内存芯片相对于目标物理地址的状态和配置、非核心争用和超线程引起的核心争用。 array[70] 可能位于与 array[30] 不同的虚拟页面（和物理页面）中，它们的加载指令 IP 和目标内存位置的地址可能以复杂的方式与预取器交互。所以 cache miss1 与 cache miss2 不同的原因可能有很多。彻底调查是可能的，但正如您想象的那样，这需要付出很多努力。通常，如果您的核心频率大于 1.5 GHz（小于高性能 Intel 处理器上的 TSC frequency），那么 L3 负载未命中将至少需要 60 个核心周期。在您的情况下，两个未命中延迟都超过 100 个周期，因此这些很可能是 L3 未命中。但在极少数情况下，cache miss2 似乎接近 L3 或 L2 延迟范围，这可能是由于预取。

我确定以下代码在统计上对 Haswell 进行了更准确的测量：

t1 = __rdtscp(&dummy);
tmp = *((volatile int*)array + 30);
asm volatile ("add , %1\n\t"
              "add , %1\n\t"
              "add , %1\n\t"
              "add , %1\n\t"
              "add , %1\n\t"
              "add , %1\n\t"
              "add , %1\n\t"
              "add , %1\n\t"
              "add , %1\n\t"
              "add , %1\n\t"
              "add , %1\n\t"
          : "+r" (tmp));          
t2 = __rdtscp(&dummy);
t2 = __rdtscp(&dummy);
loadlatency = t2 - t1 - 60; // 60 is the overhead

loadlatency是4个周期的概率是97%。 loadlatency为8个周期的概率为1.7%。 loadlatency取其他值的概率为1.3%。其他的值都大于8且是4的倍数。稍后我会尝试添加解释。

使用时间戳计数器的内存延迟测量

Memory latency measurement with time stamp counter

c

performance

x86

cpu-architecture

tsc