如何避免为大型数据集编写嵌套 for 循环？

Question

对于双变量问题，outer 很可能是最好的解决方案，如果要循环的 space 足够小，那么我们可以让 expand.grid 做我们的跑腿。但是，如果我们有两个以上的变量和一个大的 space 要循环，则这些被排除在外。 outer 无法处理两个以上的变量，并且 expand.grid 占用的内存比我见过的机器能够占用的还要多。

我最近发现自己在写这样的代码：

n<-1000
for(c in 1:n){
    for(b in 1:c){
        for(a in 1:b){
            if(foo(a,b,c))
            {
                bar(a,b,c)
            }
        }
    }
}

在这些情况下，嵌套循环似乎确实是自然的解决方案（例如，mapply 行不通，tapply 没有好的因素可供使用），但是是否有更好的解决方案方法？这似乎是通往错误代码的道路。

我怀疑 combn 可能会以某种方式做到这一点，但根据我的经验，它很快就会陷入与 expand.grid 相同的记忆陷阱。如果没记错的话，我也知道它会采取不明智的步骤，告诉我更改递归限制的全局设置。

Answer 1

我之前的函数 lazyExpandGrid 不是完美匹配，但我认为它解决了您对内存耗尽的担忧。其他语言有惰性迭代器的前景； R在iterators包里有，由于我对它不精通，前段时间写了this gist解痒

lazyExpandGrid 的一个问题是它期望因子是预先定义的。这可以通过快速条件处理，因此它可以节省内存，但不可否认 space 效率不高。我不认为在方法中实现条件是一个快速修复，因为它延迟处理扩展的机制是知道数学上哪个索引附加到哪个因素的组合 ...和条件会破坏它。

这里是该函数的工作方式：

n <- 3
it <- lazyExpandGrid(aa = 1:n, bb = 1:n, cc = 1:n)
while (length(thistask <- it$nextItem())) {
  if (with(thistask, bb > aa || cc > bb)) next
  print(jsonlite::toJSON(thistask))
}
# [{"aa":1,"bb":1,"cc":1}] 
# [{"aa":2,"bb":1,"cc":1}] 
# [{"aa":3,"bb":1,"cc":1}] 
# [{"aa":2,"bb":2,"cc":1}] 
# [{"aa":3,"bb":2,"cc":1}] 
# [{"aa":3,"bb":3,"cc":1}] 
# [{"aa":2,"bb":2,"cc":2}] 
# [{"aa":3,"bb":2,"cc":2}] 
# [{"aa":3,"bb":3,"cc":2}] 
# [{"aa":3,"bb":3,"cc":3}] 

### to demonstrate what an exhausted lazy-expansion looks like
it$nextItem()
# NULL
it$nextItem()
# NULL

（请注意带有 next 的条件如何跳过这些组合。）

这将转化为您的流程：

n <- 1000
it <- lazyExpandGrid(aa = 1:n, bb = 1:n, cc = 1:n)
it
# lazyExpandGrid: 4 factors, 1e+09 rows
#   $ index : 0

while (length(thistask <- it$nextItem())) {
  if (with(thistask, bb > aa || cc > bb)) next
  with(thistask, {
    if (foo(aa, bb, cc)) bar(aa, bb, cc)
  })
}

（或不使用 with，使用 thistask$aa，等等）

注意：我不想撒谎，不过，这简化了流程，但并没有使流程变快。在这种情况下，做某事 1e+09 次将花费 time，而且我不知道除了并行操作和友好的集群之外还有什么可以帮助的R主机。（我开始运行一个空的 no-op while 循环，花了 268 秒来完成其中的 822K。我希望你有足够的处理能力。）

Answer 2

这是重复的组合。 rcppalgos 可能是开箱即用的最佳选择，但在 n = 1000L 时，需要处理超过 5 亿种组合，这将占用约 2GB 的内存。

library(RcppAlgos)
n = 1000L
mat <- comboGeneral(n, 3L, repetition = TRUE)

现在有两条路可走。如果你有 RAM 并且你的函数可以被矢量化，你可以很快地完成上面的操作。比方说，如果组合的总和大于 1000，则您需要组合的均值，否则您需要组合的总和。

res <- if (rowSums(mat) > 1000L) 
  rowMeans(mat)
else
  rowSums(mat)

## Error: cannot allocate vector of size 1.2 Gb

哦不！我得到了可怕的分配向量错误。 rcppalgos 允许您 return 函数的结果。但是请注意，它 return 是一个列表并且速度要慢得多，因为它必须评估您的 R 函数而不是留在 C++ 中。正因如此，我改成了n = 100L因为我没有一整天...

comboGeneral(100L, 3L, repetition = TRUE,
                        FUN = function(x) { 
                          if (sum(x) > 100L)
                            mean(x)
                          else
                            sum(x)
                        }
)

如果我有一个静态集，我总是从 n 中选择 3 种组合，我可能会根据 foo(a,b,c) 和 bar(a,b,c) 直接使用 Rcpp 代码] 是，但首先我想了解更多有关功能的信息。

Answer 3

purrr .filter 的解决方案也有效：

library(purrr)

n <- 10L
levels <- 3L

# keep only elements below diagonal
isdesc<- function(...){all(diff(unlist(list(...)))<=0)}
# some extra filtering
foo <- function(...) { sum(unlist(list(...)))==27}

filter <- function(...) {!isdesc(...)|!foo(...)}

cross_list <- cross(rep(list(1L:n),levels),.filter = filter)

bar <- function(...) ( unlist(list(...))) 

cross_list %>% map(bar)

不幸的是，与 grid.expand 一样，它不能很好地扩展，因为 cross 在过滤之前首先分配完整的笛卡尔积。

Answer 4

重要的是要指出为什么使用 rcpp 既简单又推荐。

当我们参考 r, under the hood is a bunch of code compiled in c. Thus far, the R developers have not needed to develop compiled code to allow arbitrary functions foo() and bar() to be used in combinations with repetitions. So as users, we can use an r loop to have the flexibility of r 或者，当我们有很多迭代要循环时，看看一些替代方案。

R 循环很容易变成 Rcpp 循环。我已经包含了任意函数，所以我们可以 return 一些东西（如果 OP post 也包含一些东西到 return 就好了......）：

#include <Rcpp.h>
using namespace Rcpp;

bool foo(int x, int y, int z) {
  return((x + y + z) > 50);
}

int bar(int x, int y, int z) {
  return(x - y + z);
}

// [[Rcpp::export]]
double manual_combos_w_reps(const int n) {
  double ans = 0;

  for (int i = 1; i <= n; i++) {
    for (int j = 1; j <= i; j++) {
      for (int k = 1; k <= j; k++) {
        if (foo(i, j, k)) {
          ans += bar(i, j, k);
        }
      }
    }
  }

  return(ans);
}

这是 R 中的对应部分，它只是添加了 foo(...) 和 bar(...) 的代码。

r_foo = function(x, y, z) {
  return((x + y + z) > 50L)
}

r_bar = function (x, y, z) {
  return(x - y + z)
}

r_loop = function (n) {
  ans = 0;
  for (i in 1:n) {
    for (j in 1:i) {
      for (k in 1:j) {
        if (r_foo(i, j, k)) {
          ans = ans + r_bar(i, j, k)
        }
      }
    }
  }
  return(ans)
}

现在这是神奇的部分。 Rcpp 飞过这些迭代。对于 n = 1000L，R 代码需要 360 秒才能达到运行。运行.

只需要 0.5 秒的 Rcpp

n = 10L
bench::mark(manual_combos_w_reps(n)
            , r_loop(n)
            )

### A tibble: 2 x 13
##  expression                 min median `itr/sec` mem_alloc
##  <bch:expr>              <bch:> <bch:>     <dbl> <bch:byt>
##1 manual_combos_w_reps(n)  4.7us    5us   178048.    2.48KB
##2 r_loop(n)               1.63ms 1.68ms      505.        0B

n = 100L

### A tibble: 2 x 13
##  expression                min median `itr/sec` mem_alloc
##  <bch:expr>              <bch> <bch:>     <dbl> <bch:byt>
##1 manual_combos_w_reps(n) 467us  469us   2064.      2.48KB
##2 r_loop(n)               627ms  627ms      1.60        0B

n = 1000L

### A tibble: 2 x 13
##  expression                   min   median `itr/sec`
##  <bch:expr>              <bch:tm> <bch:tm>     <dbl>
##1 manual_combos_w_reps(n) 459.29ms 459.39ms   2.18   
##2 r_loop(n)                  6.04m    6.04m   0.00276

您绝对应该为此研究 rcpp - 基础 R 中没有真正规范的答案可以为您正在执行的任务提供高性能。唯一关心的是您实际的 foo() 和 bar() 函数是什么，因为它们可能难以在 rcpp 中实现。

如何避免为大型数据集编写嵌套 for 循环？

How can I avoid writing nested for loops for large data sets?

r

nested-loops