简单 mapreduce 问题的高 GC 时间

Question

我有一个用 Julia 编写的模拟程序，它作为主循环的一部分执行与此等效的操作：

# Some fake data
M = [randn(100,100) for m=1:100, n=1:100]
W = randn(100,100)
work = zip(W,M)
result = mapreduce(x -> x[1]*x[2], +,work)

换句话说，加权矩阵的简单求和。对上述代码进行计时会产生

0.691084 seconds (79.03 k allocations: 1.493 GiB, 70.59% gc time, 2.79% compilation time)

我对大量内存分配感到惊讶，因为这个问题应该可以就地解决。为了查看是否是我对 mapreduce 的使用有误，我还测试了以下等效实现：

@time begin
    res = zeros(100,100)
    for m=1:100
        for n=1:100
            res += W[m,n] * M[m,n]
        end
    end
end

这给了

0.442521 seconds (50.00 k allocations: 1.491 GiB, 70.81% gc time)

因此，如果我用 C++ 或 Fortran 编写此代码，就地完成所有这些操作会很简单。这在 Julia 是不可能的吗？或者我在这里遗漏了什么...？

Answer 1

可以这样原地做:

function ws(W, M)
    res = zeros(100,100)
    for m=1:100
        for n=1:100
            @. res += W[m,n] * M[m, n]
        end
    end
    return res
end

时间是：

julia> @time ws(W, M);
  0.100328 seconds (2 allocations: 78.172 KiB)

请注意，为了就地执行此操作，我使用了广播（我也可以使用循环，但它是一样的）。

您的代码的问题在于：

res += W[m,n] * M[m,n]

你得到两个分配：

当你做乘法时W[m,n] * M[m,n]分配一个新的矩阵。
当你做加法时res += ... 再次分配一个矩阵

通过将广播与 @. 一起使用，您可以执行就地操作，请参阅 https://docs.julialang.org/en/v1/manual/mathematical-operations/#man-dot-operators 了解更多说明。

另外请注意，我已将代码包装在一个函数中。如果你不这样做，那么访问 W 和 M 是类型不稳定的，这也会导致分配，参见 https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-global-variables.

Answer 2

我想对 Bogumił 的回答进行补充。缺少广播是主要问题，但除此之外，循环和 mapreduce 变体在基本语义方面有所不同。

mapreduce 的目的是通过与标识元素 init 的关联运算以未指定的顺序进行归约。这尤其还包括并行运行部分的（理论上的）选项，并且在突变方面效果不佳。来自文档：

The associativity of the reduction is implementation-dependent. Additionally, some implementations may reuse the return value of f for elements that appear multiple times in itr. Use mapfoldl or mapfoldr instead for guaranteed left or right associativity and invocation of f for every value.

和

It is unspecified whether init is used for non-empty collections.

循环变体真正对应的是折叠，它具有明确定义的顺序和初始（不一定是身份）元素，因此可以使用就地缩减运算符：

Like reduce, but with guaranteed left associativity. If provided, the keyword argument init will be used exactly once.

julia> @benchmark foldl((acc, (m, w)) -> (@. acc += m * w), $work; init=$(zero(W)))
BenchmarkTools.Trial: 45 samples with 1 evaluation.
 Range (min … max):  109.967 ms … 118.251 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     112.639 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   112.862 ms ±   1.154 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▄▃█ ▁▄▃                                        
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄███▆███▄▁▄▁▁▄▁▁▄▁▁▁▁▁▄▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  110 ms           Histogram: frequency by time          118 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mapreduce(Base.splat(*), +, $work)
BenchmarkTools.Trial: 12 samples with 1 evaluation.
 Range (min … max):  403.100 ms … 458.882 ms  ┊ GC (min … max): 4.53% … 3.89%
 Time  (median):     445.058 ms               ┊ GC (median):    4.04%
 Time  (mean ± σ):   440.042 ms ±  16.792 ms  ┊ GC (mean ± σ):  4.21% ± 0.92%

  ▁           ▁                 ▁   ▁      ▁ ▁    ▁▁▁     █   ▁  
  █▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁█▁█▁▁▁▁███▁▁▁▁▁█▁▁▁█ ▁
  403 ms           Histogram: frequency by time          459 ms <

 Memory estimate: 1.49 GiB, allocs estimate: 39998.

这样想：如果你将函数写成 parallel for loop 并减少 (+)，迭代也会有一个未指定的顺序，并且你会有必要的内存开销将单个结果复制到累积线程。

因此，存在一个权衡。在您的示例中， allocation/copying 占主导地位。在其他情况下，映射操作可能占主导地位，并行缩减（具有未指定的顺序，但复制开销）是值得的。

简单 mapreduce 问题的高 GC 时间

High GC time for simple mapreduce problem

performance

julia