为什么 purrr::map2 与 base mapply 相比这么慢？

Question

考虑这个简单的基准测试

list1 <- as.list(rep(1, 50))
list2 <- as.list(rep(1, 50))

microbenchmark::microbenchmark(
+   map2(list1, list2, sum))
Unit: microseconds
                    expr    min       lq     mean   median      uq     max neval
 map2(list1, list2, sum) 375.31 384.2045 481.8708 407.8115 420.641 7923.58   100

microbenchmark::microbenchmark(
+   mapply(sum, X=list1, Y=list2,  SIMPLIFY = FALSE))
Unit: microseconds
                                                expr    min     lq     mean  median      uq    max neval
 mapply(sum, X = list1, Y = list2, SIMPLIFY = FALSE) 46.187 50.634 57.45634 53.3715 59.8715 127.27   100

为什么 purrr:map2 比 mapply 慢 8 倍？我的意思是，我只是简单地将两个列表中的数字并排相加。

问题是我在当前代码中使用 map2，所以我想了解这里的开销是多少（以及如何解决）

谢谢！

Answer 1

正如 @eipi10 在评论中指出的那样，当使用大量数据时，一些函数调用开销变得不那么重要：

list1 <- as.list(rep(1, 50000))
list2 <- as.list(rep(1, 50000))
microbenchmark(map2(list1, list2, sum), mapply(sum, X=list1, Y=list2,  SIMPLIFY = FALSE))
Unit: milliseconds
                                                expr      min       lq     mean   median       uq      max neval cld
                             map2(list1, list2, sum) 73.84420 78.21917 82.53853 79.48526 81.28048 218.9266   100   b
 mapply(sum, X = list1, Y = list2, SIMPLIFY = FALSE) 51.92849 54.66514 61.34755 56.99206 58.67459 204.2119   100  a

mapply 使用 .Internal 而 purr::map2 使用 .Call 来访问进行处理的底层 C 函数。它们的工作方式存在一些差异，尤其是在参数评估方面，以及 R 搜索底层代码的方式。

.Internal 上的 R 帮助给出了神秘信息：

.Internal performs a call to an internal code which is built in to the R interpreter.

Only true R wizards should even consider using this function, and only R developers can add to the list of internal functions.

但是，R Internals 手册解释说：

C code compiled into R at build time can be called directly in what are termed primitives or via the .Internal interface, which is very similar to the .External interface except in syntax. More precisely, R maintains a table of R function names and corresponding C functions to call, which by convention all start with ‘do_’ and return a SEXP. This table (R_FunTab in file src/main/names.c) also specifies how many arguments to a function are required or allowed, whether or not the arguments are to be evaluated before calling, and whether the function is ‘internal’ in the sense that it must be accessed via the .Internal interface, or directly accessible in which case it is printed in R as .Primitive.

和

A small number of primitives are specials rather than builtins, that is they are entered with unevaluated arguments. This is clearly necessary for the language constructs and the assignment operators, as well as for && and || which conditionally evaluate their second argument, and ~, .Internal, call, expression, missing, on.exit, quote and substitute which do not evaluate some of their arguments.

.Call 注释的帮助文件：

If one of these functions is to be used frequently, do specify PACKAGE (to confine the search to a single DLL) or pass .NAME as one of the native symbol objects. Searching for symbols can take a long time, especially when many namespaces are loaded.

这意味着在使用 .Call 时需要花费一些时间在 DLL 中搜索函数。值得注意的是，purr::map2在使用.Call时没有指定包名，这样做可能会减少所需的开销。

为什么 purrr::map2 与 base mapply 相比这么慢？

why is purrr::map2 so slow compared to base mapply?

r

list

lapply

purrr