在编译时评估函数成本的通用方法

Question

我目前正在研究多维数组迭代器的实现。考虑到代表具有不同对齐方式（二维中的行与主要列）的兼容数据的两个连续范围（用于 std::equal、std::copy 目的）的迭代，我想找到每个迭代器给出的步幅顺序最快的执行时间。

例如：

row of vector components = A -> m elements
row of vectors =           B -> n elements
2D plan of vectors =       C -> 3 elements
row of plan of vectors =   D -> 10 elements

given the datas ordered by ascending strides:
first array:   B | A | C | D
second array:  B | A | D | C

 Obviously, we can iterate over both iterators by bunches of m*n elements. Then:

 If we choose the first array convention, the first iterator is contiguous and the second one 
 will perform (3 - 1)*(10 - 1) jumps forward with a stride of 10 and (10 - 1) jumps backward.

 If we choose the second array convention, the second iterator is contiguous and the first one will 
 perform (10 - 1)*(3 - 1) jumps forward with a stride of 3 and (3 - 1) jumps backward.

=> The second convention is better at everything in this example.

由于我必须考虑很多因素，如内存来回、连续性和迭代器实现本身（这不是微不足道的），我想执行一个实验计划。但我也知道编译时的所有内容（大小和步幅），因此在编译时为每个模板实例化执行实验计划会很酷。我的问题是：

是否可以在编译时评估某些指令的运行时成本，因为除了输入数组的内存地址之外的所有内容在编译时都是已知的？

Answer 1

没有。你的问题是基于错误的假设。

一些错误的假设（可能还有其他假设）：

函数按原样使用：编译器可能会在很多地方内联它，或者决定最好将它放在一个单独的函数中，因为代码大小会增加。由于周围代码的行为可能略有不同，因此您可能会看到不同的性能。
一条指令是有成本的：处理器运行指令在许多情况下是乱序的，或者它们并行化了指令。如果它被其他内存访问包围并且得到它的成本摊销，那么可能需要很长时间的东西（如除法）可能会被隐藏。
性能与处理器无关。编译器不知道您要运行使用哪个特定处理器、高速缓存或高速缓存行有多大、主内存有多快或 good/bad 分支预测将如何。所有这些都对性能产生巨大影响。

您可以做的是剖析和测量。使用此功能分析应用程序，看看您是否真的需要修复它。衡量您获得的性能并尝试不同的选项。

在编译时评估函数成本的通用方法

generic way to evaluate a function cost at compile time

c++

arrays

optimization

iterator

constexpr