使用环绕优化跨越三个连续索引的公式的性能

Question

我想优化这个公式的实现。

公式如下：

x 是一个值数组。 i 从 1 到 N，其中 N > 2400000。对于 i=0，i-1 是最后一个元素，对于 i=lastElement，i+1 是第一个元素。这是我写的代码：

   x <- 1:2400000
   re <- array(data=NA, dim = NROW(x))
   lastIndex = NROW(x)
   for(i in 1:lastIndex){
      if (i==1) {
        re[i] = x[i]*x[i] - x[lastIndex]*x[i+1]
      } else if(i==lastIndex) {
        re[i] = x[i]*x[i] - x[i-1]*x[1]
      } else {
        re[i] = x[i]*x[i] - x[i-1]*x[i+1]  
      }
    }

R 中 apply 可以做到吗？

Answer 1

公式的 lapply 实现如下所示：

x <- c(1:2400000) 
last <- length(x)

re <- lapply(x, function(i) {
    if(i == 1) {
        x[i]*x[i] - x[last]*x[i+1]
    } else if (i == last) {
        x[i]*x[i] - x[i-1]*x[1]
    } else {
        x[i]*x[i] - x[i-1]*x[i+1]  
    }
}) 

re <- unlist(re)

lapply 将 return 一个列表，因此转换为向量是使用 unlist()

Answer 2

1) 您可以通过用最后一行和第一行的副本填充数组 x 的开头和结尾来避免计算中的所有特殊大小写；像这样：

N <- NROW(x)
x <- rbind(x[N], x, x[1]) # pad start and end to give wraparound 

re <- lapply(2:N, function(i) { x[i]*x[i] - x[i-1]*x[i+1] } )
#re <- unlist(re) as andbov wrote

# and remember not to use all of x, just x[2:N], elsewhere

2）直接向量化，如@Dason的回答：

# Do the padding trick on x , then
x[2:N]^2 - x[1:N-1]*x[3:N+1]

3) 如果性能很重要，我怀疑使用 data.table 否则 i 上的 for 循环会更快，因为它引用三个连续的行。

4) 为了获得更好的性能，

5) 如果您需要更快的速度，使用 Rcpp 扩展（C++ 底层）

查看我引用的那些问题，了解使用 lineprof 和微基准测试找出瓶颈所在的好例子。

Answer 3

我们可以为此使用直接向量化

# Make fake data
x <- 1:10
n <- length(x)
# create vectors for the plus/minus indices
xminus1 <- c(x[n], x[-n])
xplus1 <- c(x[-1], x[1])

# Use direct vectorization to get re
re <- x^2 - xminus1*xplus1

Answer 4

如果真的每个 x[i] 都等于 i 那么你可以做一点数学运算：
xi^2 - (xi-1)*(xi+1) = 1
所以结果的所有元素都是1（只有第一个和最后一个不是1）。
结果是：

c(1-2*N, rep(1, N-2), N*N-(N-1))

在一般情况下（x 中的任意值）你可以这样做（如 Dason 的回答）：

x*x - c(x[N], x[-N])*c(x[-1], x[1])

这是 rollapply() 来自 zoo 的解决方案：

library("zoo")
rollapply(c(x[length(x)],x, x[1]), width=3, function(x) x[2]^2 - x[1]*x[3]) # or:
rollapply(c(tail(x,1), x, x[1]), width=3, function(x) x[2]^2 - x[1]*x[3])

这是基准：

library("microbenchmark")
library("zoo")

N <- 10000
x <- 1:N

microbenchmark(
  math=c(1-2*N, rep(1, N-2), N*N-(N-1)), # for the data from the question
  vect.i=x*x - c(x[N], x[-N])*c(x[-1], x[1]), # general data
  roll.i=rollapply(c(x[length(x)],x, x[1]), width=3, function(x) x[2]^2 - x[1]*x[3]), # or:
  roll.tail=rollapply(c(tail(x,1), x, x[1]), width=3, function(x) x[2]^2 - x[1]*x[3])
)
# Unit: microseconds
#      expr       min         lq        mean     median         uq        max neval cld
#      math    33.613    34.4950    76.18809    36.9130    38.0355   2002.152   100  a 
#    vect.i   188.928   192.5315   732.50725   197.1955   198.5245  51649.652   100  a 
#    roll.i 56748.920 62217.2550 67666.66315 68195.5085 71214.9785 109195.049   100   b
# roll.tail 57661.835 63855.7060 68815.91001 67315.5425 71339.6045 119428.718   100   b

使用环绕优化跨越三个连续索引的公式的性能

Optimize performance of a formula spanning three consecutive indices, with wraparound

performance

r

formula

apply

indices