为什么遍历比合并两个排序的 std::list 更耗时？

Question

令我惊讶的是，遍历比合并两个已排序的 std::list 花费的时间多 12% 左右。由于合并可以被认为和实现为连续的元素比较、列表拼接和迭代器遍历两个分离的排序链表。因此，遍历不应比合并它们慢，尤其是当两个列表足够大时，因为迭代元素的比例正在增加。

不过，结果好像和我想的不太一样，我是这样验证上面的想法的：

std::list<int> list1, list2;

for (int cnt = 0; cnt < 1 << 22; cnt++)
    list1.push_back(rand());
for (int cnt = 0; cnt < 1 << 23; cnt++)
    list2.push_back(rand());

list1.sort();
list2.sort();

auto start = std::chrono::system_clock::now();  // C++ wall clock

// Choose either one option below
list1.merge(list2);         // Option 1
for (auto num : list1);     // Option 2
for (auto num : list2);     // Option 2

std::chrono::duration<double> diff = std::chrono::system_clock::now() - start;
std::cout << std::setprecision(9) << "\n       "
          << diff.count() << " seconds (measured)" << std::endl;  // show elapsed time

PS。 icc 足够聪明，可以排除选项 2。尝试 sum += num; 并打印出 sum。

这是perf的输出：（不使用perf测量的时间保持不变）

选项 1：合并

       0.904575206 seconds (measured)

 Performance counter stats for './option-1-merge':

    33,395,981,671      cpu-cycles
       149,371,004      cache-misses              #   49.807 % of all cacherefs
       299,898,436      cache-references
    24,254,303,068      cycle-activity.stalls-ldm-pending    

       7.678166480 seconds time elapsed

选项 2：遍历

       1.01401903 seconds (measured)

 Performance counter stats for './option-2-traverse':

    33,844,645,296      cpu-cycles
       138,723,898      cache-misses             #   48.714 % of all cacherefs
       284,770,796      cache-references
    25,141,751,107      cycle-activity.stalls-ldm-pending

       7.806018949 seconds time elapsed

由于这些链表上的属性可怕的空间局部性。缓存未命中是导致CPU 停顿的主要原因，并且占用了大部分CPU 资源。奇怪的是，选项 2 的缓存未命中率低于选项 1，但它有更多的 CPU 停顿和 CPU 周期来完成其任务。是什么导致了这种异常？

Answer 1

如你所知，占用你所有时间的是记忆。

缓存未命中很糟糕，但停顿也很糟糕。

来自this paper：

Applications with irregular memory access patterns, e.g.,dereferencing chains of pointers when traversing linked lists or trees, may not generate enough concurrently outstanding requests to fully utilize the data paths. Nevertheless, such applications are clearly limited by the performance of memory accesses as well. Therefore, considering the bandwidth utilization is not sufficient to detect all memory related performance issues.

基本上，随机游走指针无法使内存带宽饱和。

通过等待下一个指针的加载位置，每次迭代都会阻塞每个循环。如果它不在缓存中，cpu 什么也做不了——它停止了。

组合紧密 loop/merge 尝试将两个页面加载到缓存中。当一个加载时，有时 cpu 可以在另一个上前进。

您测量的结果是合并比裸浪费的双迭代有更少的停顿。

或者换句话说，

24,254,303,068      cycle-activity.stalls-ldm-pending

是一个很大的数字，小于：

25,141,751,107      cycle-activity.stalls-ldm-pending

我很惊讶这足以产生 10% 的差异，但这就是性能与衡量有关的原因。

为什么遍历比合并两个排序的 std::list 更耗时？

Why is traversing more time-consuming than merging on two sorted std::list?

c++

sorting

merge

traversal