算法的每字节测量周期

Question

我已经在 C 语言 http://primates.ae/ 中实现了 PRIMATE 密码的位切片实现。我是使用 SIMD 编程实现的，所以我在我的代码中使用了 AVX2 指令集。

我目前正在尝试准确衡量我的实施效果如何，但我并不真正相信我得到的当前数字。以我目前的数字，我得到大约每字节 200 个周期，这似乎比 over ciphers 得到的要好。

目前，我的代码如下所示

#typedef u64 unsigned long long

u64 start, finish;
u64 samples[1000000];
data = calloc(4000, sizeof(unsigned char));

//Performance test on a single core, as that is the standard when computing cycles/byte.
SetThreadAffinityMask(GetCurrentThread(), 0x00000008);

//Find CPU clock speed
start = _rdtsc();
sleep(1000);
finish = _rdtsc();
cpu_frequency = finish-start;

//Take a lot of samples and use median of these.
for (int i = 0; i < 1000000; i++){
   start = _rdtsc();
   encrypt(data);
   decrypt(data);
   finish = _rdtsc();
   samples[i] = finish - start;
}
qsort(samples);
u64 median = samples[500000];
double cycles_per_byte = 1 / (4000.00 / median);

我相信我的计算是正确的，所以我想知道...

使用_rdtsc()来测量每个字节的周期数是否错误？
原因可能是我没有测量专门花在我的代码上的时钟周期，而是花在整个系统上的时钟周期？（我不知道，如果我能看到在那种情况下我的代码独占了多少）
我可以运行它在 Windows 而不是例如linux 差别很大吗？

我试过用 GCC 和 MSVC 编译代码，没有区别（GCC 使用 /O2 或 /O3 大约快 1%；不记得是哪个了）。我运行仅在一个内核上进行测试，并关闭了 Intel Turboboost 和超线程。

我的完整源代码在这里： https://github.com/opolo/Bitsliced-AEAD/tree/master/Primates/APE120_Bitsliced 我的测试套件在 Ref.c 中，而位切片排列在 Primate.c 中......代码现在不是超级干净，我的错。这就是为什么我之前试图给出一个例子，而不是完全 c/p 我的代码。

Answer 1

Is it wrong to use _rdtsc() for measuring cycles per byte?

不，这是正确的方法。我更喜欢对 rdtsc 指令使用内联汇编来保证内联。这是一个依赖于实现的函数，所以你真的不知道发生了什么。特别是您不知道它是否正确地防止了乱序执行。参见 here for an inline asm solution。我不知道 x86 内部函数是做什么的。

Could the reason be that I don't measure clock cycle spent exclusively on my code, but on the system as a whole?

是的，函数调用有一些开销。在现代平台上通常有一个 O(100) 的时钟节拍开销。如果您的数据集足够大，应该无关紧要。

Could the fact that I run it on Windows instead of e.g. linux do a huge difference?

不

所以您没有从算法中获得想要的性能？这一切都取决于您的实施方式，所以我不会责怪您的计时功能。完善算法实现有许多复杂的问题。如果您使用内联 asm 或内在函数显式矢量化内容，请注意，与标准 C 和优化的编译器相比，较差或过度抽象的实现的性能可能更差。一个好的方法是首先编写算法的 C 实现作为基准和验证，然后开始手动优化。

encrypt/decrypt 函数在哪里？

算法的每字节测量周期

Measuring cycles per byte for an algorithm

c

c++

benchmarking

gcc

simd