向量的稀疏矩阵

Sparse matrix from vector

我有一个包含值 (val) 的向量和一个表示组成员资格 (group) 的向量:

vec   <- 1:9
group <- rep(1:3, c(2,4,3))

假设我们有 K 个组和总共 N 个值,因此两个向量的长度都为 N。目标是有效地构造一个稀疏 'block-diagonal' 矩阵,其中第一列包含第 1 组的值,第二列包含第 2 组的值,依此类推。但是,在每行应该只有一个值的意义上,这些值不应该 'overlap',请参见下面的解决方案。我需要用非常大的 KN 执行此操作数千次。因此,以下基于循环的解决方案效率不够:

K     <- length(unique(group))
N     <- length(group)
M     <- matrix(0, N, K)

for(k in 1:K){
  
 M[group == k, k] <- vec[group == k]
        
}

Matrix::Matrix(M, sparse = T)

9 x 3 sparse Matrix of class "dgCMatrix"
           
 [1,] 1 . .
 [2,] 2 . .
 [3,] . 3 .
 [4,] . 4 .
 [5,] . 5 .
 [6,] . 6 .
 [7,] . . 7
 [8,] . . 8
 [9,] . . 9

由于内存原因,在密集的NK矩阵的基础上直接构造一个稀疏矩阵比较理想,无需中间步骤。


编辑

对于上面给出的小例子,事实证明基于循环的解决方案非常有效:

Unit: microseconds
     expr     min       lq     mean   median       uq      max neval cld
      ben 734.280 771.7000 826.8372 787.5230 805.2710 3185.158   100   b
      CJR 711.187 745.1855 813.9948 766.9960 781.7495 4832.476   100   b
 original 199.714 221.9520 235.4320 227.9395 236.7065  379.757   100  a 

然而,当转向高维示例(N = 10,000 和 K = 1,000)时,CJR 的解决方案在速度方面是赢家:

Unit: milliseconds
     expr        min         lq       mean     median         uq        max neval cld
      ben 128.529311 133.308972 140.032070 135.921289 139.272589 289.668852   100  b 
      CJR   1.841474   2.055513   2.261732   2.201557   2.395925   6.330544   100 a  
 original  93.387806 118.348398 171.380301 125.884493 244.421699 365.871433   100   c

Matrix::.bdiag() 将允许您直接从矩阵列表构造块对角(稀疏)矩阵:

mm <- lapply(split(vec, group), matrix)
Matrix::.bdiag(mm)

.bdiag(mm)约等于do.call(Matrix::bdiag, mm)?bdiag表示

The value of ‘bdiag()’ inherits from class ‘CsparseMatrix’, whereas ‘.bdiag()’ returns a ‘TsparseMatrix’.

(前者是排序压缩的面向列的形式,后者是三元组形式:?"TsparseMatrix-class"表示'once [a triplet-oriented matrix] is created, however, the matrix is generally coerced to a ‘CsparseMatrix’ for further operations.')

?bdiag还有一个:

This function has been written and is efficient for the case of relatively few block matrices which are typically sparse themselves.

因此,此解决方案肯定会比您现有的更好,但可能会进一步改进。

vec   <- 1:9
group <- rep(1:3, c(2,4,3))

我建议直接构建您需要的行和列索引,然后将它们提供给稀疏构造函数。

i <- unlist(split(vec, group), use.names = F)
j <- vapply(split(vec, group), length, numeric(1))
Matrix::sparseMatrix(i=i,
                     j=rep(1:length(j), j),
                     x=vec[i])

9 x 3 sparse Matrix of class "dgCMatrix"
           
 [1,] 1 . .
 [2,] 2 . .
 [3,] . 3 .
 [4,] . 4 .
 [5,] . 5 .
 [6,] . 6 .
 [7,] . . 7
 [8,] . . 8
 [9,] . . 9

这在组不是单调的情况下有效:

vec   <- 1:9
group <- c(5:1, 2:5)

9 x 5 sparse Matrix of class "dgCMatrix"
               
 [1,] . . . . 1
 [2,] . . . 2 .
 [3,] . . 3 . .
 [4,] . 4 . . .
 [5,] 5 . . . .
 [6,] . 6 . . .
 [7,] . . 7 . .
 [8,] . . . 8 .
 [9,] . . . . 9

但是当组是单调的时,可以使用 rle 对其进行优化(如评论中所述):

vec   <- 1:9
group <- rep(1:3, c(2,4,3))

j <- rle(group)$length
Matrix::sparseMatrix(i=1:length(group),
                     j=rep(1:length(j), j),
                     x=vec)

9 x 3 sparse Matrix of class "dgCMatrix"
           
 [1,] 1 . .
 [2,] 2 . .
 [3,] . 3 .
 [4,] . 4 .
 [5,] . 5 .
 [6,] . 6 .
 [7,] . . 7
 [8,] . . 8
 [9,] . . 9

您可以试试下面的代码

> Matrix(`[<-`(M, cbind(seq_along(group), group), vec))
9 x 3 sparse Matrix of class "dgCMatrix"

 [1,] 1 . .
 [2,] 2 . .
 [3,] . 3 .
 [4,] . 4 .
 [5,] . 5 .
 [6,] . 6 .
 [7,] . . 7
 [8,] . . 8
 [9,] . . 9

基准

microbenchmark(
  ben = {
    mm <- lapply(split(vec, group), matrix)
    Matrix::.bdiag(mm)
  },
  CJR = {
    i <- unlist(split(vec, group), use.names = F)
    j <- vapply(split(vec, group), length, numeric(1))
    Matrix::sparseMatrix(
      i = i,
      j = rep(1:length(j), j),
      x = vec[i]
    )
  },
  TIC = {
    Matrix(`[<-`(M, cbind(seq_along(group), group), vec))
  },
  check = "equivalent"
)

显示

Unit: microseconds
 expr   min     lq    mean median    uq    max neval
  ben 564.0 599.55 662.640 654.25 686.9 1213.5   100
  CJR 523.1 564.70 643.524 619.65 675.0 1222.6   100
  TIC 165.5 191.90 217.537 208.00 234.7  520.1   100