在 3 维数组上向量化 R 中的嵌套循环

Vectorizing a nested loop in R over a 3 dimensional array

我正在使用嵌套的 R 循环从一个大型三维数组创建一个新的数据框。我已经尝试了 运行 代码,但要么工作在大约 48 小时后就失败了。执行嵌套循环的当前代码如下所示。我真的很想对循环进行矢量化以使其更有效率,但不确定如何或是否可以在多维数组上实现。非常感谢任何有关如何提高工作效率的建议。作为参考 my_array 是我的数组的一小部分,有两个切片。数组中的数据是概率值,循环在特定鼠标和标记处找到具有最大概率值的创始人。最终输出是一个数据框,其中鼠标名称为行,标记为列,创始人为数据。示例代码如下。

    founder_names <- rownames(model.probs[1,,])
    mice_names <- rownames(model.probs[,1,])
    marker_names <- colnames(model.probs[1,,])

    # Create empty data frame
    probs.df <- data.frame()

    ## Instructions for nested loop

    for(marker in marker_names) {
      for(mouse in mice_names){
        probs.df[mouse, marker] = names(which.max(my_array[mouse,,marker]))
      }
    }

来自 dput(my_array) 的示例数据:

structure(c(1.86334813592728e-08, 2.02070595143633e-10, 2.1558577630356e-08, 
2.1558577630356e-08, 2.04388477395613e-10, 2.04388477395593e-10, 
2.04388477395613e-10, 2.031707697502e-10, 2.04388477395593e-10, 
2.0317076975018e-10, 0.999999939150967, 1.19701878645413e-10, 
2.94522644878888e-08, 2.94522644878888e-08, 1.20988752710968e-10, 
1.20988752710968e-10, 1.20988752710968e-10, 1.20313358746148e-10, 
1.20988752710968e-10, 1.20313358746148e-10, 2.41632503275453e-12, 
2.53195197455819e-08, 2.89630046322804e-12, 2.89630046322804e-12, 
2.46380958026699e-08, 2.46380958026699e-08, 2.46380958026724e-08, 
2.44127737551662e-08, 2.46380958026699e-08, 2.44127737551638e-08, 
1.08633475857376e-12, 0.999999925628544, 1.30167423493078e-12, 
1.30167423493078e-12, 2.49445205965502e-08, 2.49445205965502e-08, 
2.49445205965527e-08, 2.47171256696929e-08, 2.49445205965502e-08, 
2.47171256696904e-08, 1.84322523200704e-08, 6.29795050516582e-11, 
2.13175870442828e-08, 2.13175870442849e-08, 6.40871335417646e-11, 
6.40871335417646e-11, 6.40871335417646e-11, 6.35035199711943e-11, 
6.40871335417646e-11, 6.3503519971188e-11, 0.999999939821495, 
2.75475678555388e-11, 2.91247770927105e-08, 2.91247770927134e-08, 
2.80325925630150e-11, 2.80325925630123e-11, 2.80325925630150e-11, 
2.77773153893157e-11, 2.80325925630123e-11, 2.77773153893129e-11, 
6.56947829427486e-13, 2.50477863870057e-08, 7.89281798086196e-13, 
7.89281798086277e-13, 2.43639980473783e-08, 2.43639980473783e-08, 
2.43639980473783e-08, 2.41399147887054e-08, 2.43639980473783e-08, 
2.4139914788703e-08, 1.7742262257411e-13, 0.999999926913761, 
2.13166988220277e-13, 2.13166988220277e-13, 2.46686866862984e-08, 
2.46686866862984e-08, 2.46686866863009e-08, 2.44425383948499e-08, 
2.46686866862984e-08, 2.44425383948499e-08), .Dim = c(10L, 4L, 
2L), .Dimnames = list(c("B6HER2", "X100", "X1002", "X1005", "X1006", 
    "X1007", "X1010", "X1011", "X1012", "X1014"), c("AI", "BI", "CI", 
    "DI"), c("UNC6", "JAX00000010")))

the loop finds the founder with max probability value at a specific mouse&marker.

我可能会...

# assign the dim names directly to the array:

names(dimnames(my_array)) <- c("founder", "mouse", "marker")

# enumerate combos with expand.grid(), not data.frame()

resdf = expand.grid(mouse = dimnames(my_array)$mouse, marker = dimnames(my_array)$marker)

# take maxes within slices

resdf$founder_max = dimnames(my_array)$founder[
  c(apply(my_array, c("mouse", "marker"), which.max))
]

  mouse      marker founder_max
1    AI        UNC6       X1002
2    BI        UNC6      B6HER2
3    CI        UNC6        X100
4    DI        UNC6        X100
5    AI JAX00000010       X1005
6    BI JAX00000010      B6HER2
7    CI JAX00000010        X100
8    DI JAX00000010        X100

或者,使用 reshape2:

library(reshape2)

resdf2 = melt(apply(my_array, c("mouse", "marker"), function(x) 
  dimnames(my_array)$founder[which.max(x)]
))

  mouse      marker  value
1    AI        UNC6  X1002
2    BI        UNC6 B6HER2
3    CI        UNC6   X100
4    DI        UNC6   X100
5    AI JAX00000010  X1005
6    BI JAX00000010 B6HER2
7    CI JAX00000010   X100
8    DI JAX00000010   X100

如果您仍然 运行 遇到速度问题,可以使用 apply 的替代方法,例如 matrixStats 包,或者您可以使用 Rcpp 编写自己的自定义快速代码。也可能有一些方法来处理你的问题以使用 base 中的快速 max.col 函数......虽然我没有立即看到它。


The final output is a dataframe with mice names as rows, markers with columns, and the founder as the data.

如果你真的想要那个格式,你可以在 apply:

之后停止
apply(my_array, c("mouse", "marker"), function(x) 
  dimnames(my_array)$founder[which.max(x)]
)

     marker
mouse UNC6     JAX00000010
   AI "X1002"  "X1005"    
   BI "B6HER2" "B6HER2"   
   CI "X100"   "X100"     
   DI "X100"   "X100"  

这是 矩阵,而不是 data.frame。我不认为它应该被转换为 data.frame(除了 melt),但如果你需要它,你可以包装在 as.data.frame.