R在apply函数中访问行索引

Question

我在内存中有一个大型数据集，大约有 40 万行。在这个数据框的一个子集上工作，我想生成一个大图像，并根据数据框中的条目将该图像中的元素设置为等于特定值。我使用 for 循环非常简单且无疑是愚蠢地完成了此操作：

library('Matrix')

#saveMe is a subset of the dataframe containing the x-ranges I want 
#in columns 1,2; y-ranges in 3-4, and values in 5. 
saveMe<-structure(list(XMin = c(1, 17, 19, 19, 21, 29, 29, 31, 31, 31, 31, 33, 33, 35, 37, 39, 39, 39, 41, 43), XMax = c(9, 15, 1, 3,1, 17, 37, 5, 13, 25, 35, 17, 43, 23, 47, 25, 25, 33, 21, 29), YMin = c(225, 305, 435, 481, 209, 1591, 157, 115, 1, 691, 79, 47, 893, 1805, 809, 949, 2179, 1733, 339, 739), YMax = c(277,315, 435, 499, 213, 1689, 217, 133, 1, 707, 111, 33, 903,1827, 849, 973, 2225, 1723, 341, 765), Value = c(3, 1, 0,1, 1, 4, 3, 1, 1, 0, 2, 1, 1, 0, 2, 1, 1, 2, 0, 0)), .Names = c("XMin", "XMax", "YMin", "YMax", "Value"),class = c("data.table", "data.frame"), row.names = c(NA, -20L))

#Create sparse matrix to store the result:
xMax <- max(saveMe$XMax) - min(saveMe$XMin)+1
yMax <- max(saveMe$YMax) - min(saveMe$YMin)+1
img<-Matrix(0, nrow = xMax, ncol = yMax, sparse = TRUE)

for (kx in 1:nrow(saveMe)) {
  img[as.numeric(saveMe[kx,1]):as.numeric(saveMe[kx,2]), as.numeric(saveMe[kx,3]):as.numeric(saveMe[kx,4])] <- as.numeric(saveMe[kx,5])
}
nnzero(img)
image(img)

这需要真的很长的时间——大约五个小时——而且是愚蠢的，逐行迭代。我知道通常可以使用 apply 来大大加快速度。所以，正如您所期望的那样，我已经尝试这样做了：

img<-Matrix(0, nrow = xMax, ncol = yMax, sparse = TRUE)
apFun <- function(x, imToUse){
  #idea is to then change that to something like...
  imToUse[(x[1]:x[2]), (x[3]:x[4]) ] <- x[5]
}  

apply(as.matrix(saveMe), 1, apFun,imToUse=img);
nnzero(img)
image(img)

但是，无论我尝试什么，img 中的结果元素始终为零。我认为这可能是一个可变范围问题。我究竟做错了什么？

顺便说一句，我真正想要解决的问题是为这个数据创建一个整数 "sparse image"，除了以 [XMin XMax YMin YMax] 为界的矩形等于 value（即 x[5]）。有更好的方法吗？

Answer 1

你的怀疑是正确的。试试这个说服自己：

f <- function(x){
    x <- 5
}

x <- 4

f(x)
# Nothing is returned
x 
# [1] 4

y <- f(x)
x
# [1] 4
y
# [1] 5

对于您的函数，由于您没有在 apply() 中分配结果，因此您想添加最后更新的对象作为 return 值。

apFun <- function(x, imToUse){
  #idea is to then change that to something like...
  imToUse[(x[1]:x[2]), (x[3]:x[4]) ] <- x[5]
  imToUse
}

这类似于

rm(x, y)
f <- function(x){
    x <- 5
    x
}
x <- 4
f(x)
# [1] 5
x
# [1] 4

请注意，您仍然没有更新 x。但是你return正在计算一个值。

编辑：在回顾你的函数的目的和你对 apply 的调用时，我建议你坚持使用原来的 for 循环。调用 apply 的目的是更新父环境中对象的值。在这种情况下，由于 apply 的好处是循环包装器的便利性和本地环境的保护，因此您必须经历一系列扭曲才能摆脱受保护的包装器。

如何加快速度：将您的 for 循环更改为此

for (i in seq_len(nrow(saveMe))){
  img[saveMe[[i,1]]:saveMe[[i,2]], saveMe[[i,3]]:saveMe[[i,4]]] <- saveMe[[i,5]]
}

这在哪些方面节省了您的时间？这里节省的大量时间是使用 [[ 基于索引从数据 table 中提取单个值，而不是使用 [。这是数据：

您正在 400,000 行的数据 table 中查找 5 个单个值，使用行和列整数索引（因此循环中有 2,000,000 次查找）并根据这些值分配一个数组 400,000次。分配可能难以优化，但查找却不是。让运行对数据 table 中的整数索引查找和单个值的赋值进行 100 次试验，比较 [ 和 [[ 运算符。

DT <- data.table(x = sample(5000))
single <- replicate(100, {
  system.time({
    for (i in seq_len(nrow(DT))){
      z <- DT[i,1]
    }
  })
})  
double <- replicate(100, {
  system.time({
    for (i in seq_len(nrow(DT))){
      z <- DT[[i,1]]
    }
  })
})

rowMeans(single)
# user.self   sys.self    elapsed user.child  sys.child 
#   1.69405    0.03519    1.89836    0.00000    0.00000 
rowMeans(double)
# user.self   sys.self    elapsed user.child  sys.child 
#   0.05047    0.00083    0.05668    0.00000    0.00000

这里的键值为user.self。您可以看到，基于 100 次试验，使用 [[ 提取值的速度大约快 30 倍。

R在apply函数中访问行索引

R access row index in apply function

syntax

r

large-data