dplyr bind_rows 执行时间指数

Question

我有一个要合并的 tibbles 列表（长度 = 5000）。它们都有相同的列，所以我想到使用 dplyr::bind_rows 进行合并。从表面上看，每个添加的 tibble 绑定行非常快，但是随着添加更多的 tibble，执行时间呈指数增长，而不是线性增长。进行了一些谷歌搜索后，它非常类似于此处观察到的错误：https://github.com/tidyverse/dplyr/issues/1396。尽管该错误应该已在 bind_rows 内部修复，但我仍然看到每个小标题的运行时间呈指数增长。

library(foreach)
library(tidyverse)
set.seed(123456)
tibbles <- foreach(i = 1:200) %do% {
              tibble(a = rnorm(10000), 
                     b = rep(letters[1:25], 400), 
                     c = rnorm(10000))
}
times <- foreach(i = 1:200) %do% {
            system.time(tibbles[1:i] %>% 
                            purrr::reduce(bind_rows))
}

times %>% 
     map_dbl(.f = ~.x[3]) %>% 
     plot(ylab = "time [s] per added tibble")

知道为什么会这样以及如何解决吗？

谢谢。

Answer 1

为了扩展 abhiieor 的评论，我认为 data.table 中的 rbindlist 或 rbind 可能会有所帮助。假设您正在尝试绑定一系列 tibbles（或 data.tables）的行，此代码几乎是即时的。

time <- proc.time()

data_tables <- foreach(i = 1:200) %do% {
  data.table(a = rnorm(10000), 
         b = rep(letters[1:25], 400), 
         c = rnorm(10000))
}

all_tables <- rbindlist(data_tables)

end_time <- proc.time() -  time

Answer 2

我的猜测是每次调用 rbind 时，R 都必须分配一组新的列并将数据复制过来。这将导致时间呈二次方增长。

尝试预先分配列：

system.time({
n <- vapply(tibbles, nrow, 0)
ntot <- sum(n)
cols <- list(a = numeric(ntot), b = character(ntot), c = numeric(ntot))

off <- 0
for (i in seq_along(tibbles)) {
    ix <- off + seq_len(n[[i]])
    for (j in seq_along(cols)) {
        cols[[j]][ix] <- tibbles[[i]][[j]]
    }
    off <- off + n[[i]]
}

result <- as_tibble(cols)
})
#>    user  system elapsed 
#>   0.073   0.012   0.085

与purrr::reduce方法比较：

system.time(tibbles[1:200] %>% purrr::reduce(bind_rows))
#>   user  system elapsed 
#>  4.888   2.013   6.928

不过，正如 aosmith 指出的那样，在您的情况下，最好只使用 bind_rows:

system.time(result <- bind_rows(tibbles))
#>  user  system elapsed 
#> 0.039   0.005   0.044

dplyr bind_rows 执行时间指数

dplyr bind_rows execution time exponential

r

dplyr

purrr