为什么 purrr::map2 与 base mapply 相比这么慢?
why is purrr::map2 so slow compared to base mapply?
考虑这个简单的基准测试
list1 <- as.list(rep(1, 50))
list2 <- as.list(rep(1, 50))
microbenchmark::microbenchmark(
+ map2(list1, list2, sum))
Unit: microseconds
expr min lq mean median uq max neval
map2(list1, list2, sum) 375.31 384.2045 481.8708 407.8115 420.641 7923.58 100
microbenchmark::microbenchmark(
+ mapply(sum, X=list1, Y=list2, SIMPLIFY = FALSE))
Unit: microseconds
expr min lq mean median uq max neval
mapply(sum, X = list1, Y = list2, SIMPLIFY = FALSE) 46.187 50.634 57.45634 53.3715 59.8715 127.27 100
为什么 purrr:map2
比 mapply
慢 8 倍?我的意思是,我只是简单地将两个列表中的数字并排相加。
问题是我在当前代码中使用 map2
,所以我想了解这里的开销是多少(以及如何解决)
谢谢!
正如 @eipi10 在评论中指出的那样,当使用大量数据时,一些函数调用开销变得不那么重要:
list1 <- as.list(rep(1, 50000))
list2 <- as.list(rep(1, 50000))
microbenchmark(map2(list1, list2, sum), mapply(sum, X=list1, Y=list2, SIMPLIFY = FALSE))
Unit: milliseconds
expr min lq mean median uq max neval cld
map2(list1, list2, sum) 73.84420 78.21917 82.53853 79.48526 81.28048 218.9266 100 b
mapply(sum, X = list1, Y = list2, SIMPLIFY = FALSE) 51.92849 54.66514 61.34755 56.99206 58.67459 204.2119 100 a
mapply
使用 .Internal
而 purr::map2
使用 .Call
来访问进行处理的底层 C 函数。它们的工作方式存在一些差异,尤其是在参数评估方面,以及 R 搜索底层代码的方式。
.Internal
上的 R 帮助给出了神秘信息:
.Internal performs a call to an internal code which is built in to the
R interpreter.
Only true R wizards should even consider using this function, and only
R developers can add to the list of internal functions.
但是,R Internals 手册解释说:
C code compiled into R at build time can be called directly in what
are termed primitives or via the .Internal interface, which is very
similar to the .External interface except in syntax. More precisely, R
maintains a table of R function names and corresponding C functions to
call, which by convention all start with ‘do_’ and return a SEXP. This
table (R_FunTab in file src/main/names.c) also specifies how many
arguments to a function are required or allowed, whether or not the
arguments are to be evaluated before calling, and whether the function
is ‘internal’ in the sense that it must be accessed via the .Internal
interface, or directly accessible in which case it is printed in R as
.Primitive.
和
A small number of primitives are specials rather than builtins, that
is they are entered with unevaluated arguments. This is clearly
necessary for the language constructs and the assignment operators, as
well as for && and || which conditionally evaluate their second
argument, and ~, .Internal, call, expression, missing, on.exit, quote
and substitute which do not evaluate some of their arguments.
.Call
注释的帮助文件:
If one of these functions is to be used frequently, do specify PACKAGE
(to confine the search to a single DLL) or pass .NAME as one of the
native symbol objects. Searching for symbols can take a long time,
especially when many namespaces are loaded.
这意味着在使用 .Call
时需要花费一些时间在 DLL 中搜索函数。值得注意的是,purr::map2
在使用.Call
时没有指定包名,这样做可能会减少所需的开销。
考虑这个简单的基准测试
list1 <- as.list(rep(1, 50))
list2 <- as.list(rep(1, 50))
microbenchmark::microbenchmark(
+ map2(list1, list2, sum))
Unit: microseconds
expr min lq mean median uq max neval
map2(list1, list2, sum) 375.31 384.2045 481.8708 407.8115 420.641 7923.58 100
microbenchmark::microbenchmark(
+ mapply(sum, X=list1, Y=list2, SIMPLIFY = FALSE))
Unit: microseconds
expr min lq mean median uq max neval
mapply(sum, X = list1, Y = list2, SIMPLIFY = FALSE) 46.187 50.634 57.45634 53.3715 59.8715 127.27 100
为什么 purrr:map2
比 mapply
慢 8 倍?我的意思是,我只是简单地将两个列表中的数字并排相加。
问题是我在当前代码中使用 map2
,所以我想了解这里的开销是多少(以及如何解决)
谢谢!
正如 @eipi10 在评论中指出的那样,当使用大量数据时,一些函数调用开销变得不那么重要:
list1 <- as.list(rep(1, 50000))
list2 <- as.list(rep(1, 50000))
microbenchmark(map2(list1, list2, sum), mapply(sum, X=list1, Y=list2, SIMPLIFY = FALSE))
Unit: milliseconds
expr min lq mean median uq max neval cld
map2(list1, list2, sum) 73.84420 78.21917 82.53853 79.48526 81.28048 218.9266 100 b
mapply(sum, X = list1, Y = list2, SIMPLIFY = FALSE) 51.92849 54.66514 61.34755 56.99206 58.67459 204.2119 100 a
mapply
使用 .Internal
而 purr::map2
使用 .Call
来访问进行处理的底层 C 函数。它们的工作方式存在一些差异,尤其是在参数评估方面,以及 R 搜索底层代码的方式。
.Internal
上的 R 帮助给出了神秘信息:
.Internal performs a call to an internal code which is built in to the R interpreter.
Only true R wizards should even consider using this function, and only R developers can add to the list of internal functions.
但是,R Internals 手册解释说:
C code compiled into R at build time can be called directly in what are termed primitives or via the .Internal interface, which is very similar to the .External interface except in syntax. More precisely, R maintains a table of R function names and corresponding C functions to call, which by convention all start with ‘do_’ and return a SEXP. This table (R_FunTab in file src/main/names.c) also specifies how many arguments to a function are required or allowed, whether or not the arguments are to be evaluated before calling, and whether the function is ‘internal’ in the sense that it must be accessed via the .Internal interface, or directly accessible in which case it is printed in R as .Primitive.
和
A small number of primitives are specials rather than builtins, that is they are entered with unevaluated arguments. This is clearly necessary for the language constructs and the assignment operators, as well as for && and || which conditionally evaluate their second argument, and ~, .Internal, call, expression, missing, on.exit, quote and substitute which do not evaluate some of their arguments.
.Call
注释的帮助文件:
If one of these functions is to be used frequently, do specify PACKAGE (to confine the search to a single DLL) or pass .NAME as one of the native symbol objects. Searching for symbols can take a long time, especially when many namespaces are loaded.
这意味着在使用 .Call
时需要花费一些时间在 DLL 中搜索函数。值得注意的是,purr::map2
在使用.Call
时没有指定包名,这样做可能会减少所需的开销。