连接 Scipy 稀疏矩阵而不需要慢 scipy.sparse.vstack

Joining Scipy sparse matrices without the slow scipy.sparse.vstack

我需要在一个循环中计算一个大小为 1xN 的数组,并将每个新数组堆叠在前一个数组之上。循环的长度为 1,M 并且函数的工作方式类似于以下内容:

import numpy as np
import scipy.sparse as sp

mat = np.random.uniform(size=(1,N))
mat_sp = sp.coo_matrix(mat)

for i in range(1,M):

    mat_new = np.random.uniform(size=(1,N))
    mat_sp_new = sp.coo_matrix(mat_new)

    mat_sp = sp.vstack((mat_sp,mat_sp_new))

这将导致 MxN 矩阵。但是,使用 scipy.sparse.vstack 执行此操作非常慢。与 numpy.vstack 或什至仅与预分配矩阵和 N = 10000 进行比较:

In [1]: %timeit sp.vstack((mat_sp,mat_sp)) 
315 µs ± 7.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [2]: %timeit np.vstack((mat,mat))
8.63 µs ± 87.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [3]: mat_final = np.zeros((2,10000))   

In [4]: %timeit mat_final[0]=mat; mat_final[1]=mat                               
4.03 µs ± 92.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

我的问题是有时我需要最终矩阵的大小达到 M = 105 和 N = 106 NumPy 会给出 MemoryError.

有没有一种不是很慢的堆叠稀疏行向量的方法?

使用重复的 stack,如 mat_sp = sp.vstack((mat_sp,mat_sp_new)),是一个坏主意,即使在使用 np.vstack 时也是如此。最好将所有数组收集到一个列表中,然后只使用一次 stack。 List append 在迭代使用时效率更高,并且 stacks 旨在处理整个对象列表,而不仅仅是一次两个。 List append 与对象引用就地工作;数组操作 return 每次都是一个全新的数组。

让我举例说明:

In [1]: from scipy import sparse 

制作稀疏一维数组的函数:

In [2]: def foo(): 
   ...:     x = np.zeros(10,int) 
   ...:     idx = np.random.randint(0,10,size=4) 
   ...:     x[idx] = idx 
   ...:     return x 
   ...:                                                                                        
In [3]: foo()                                                                                  
Out[3]: array([0, 1, 2, 3, 0, 0, 0, 0, 0, 9])
In [4]: foo()                                                                                  
Out[4]: array([0, 0, 0, 3, 0, 0, 0, 0, 0, 9])

简单列表追加:

In [5]: alist = []                                                                             
In [6]: for i in range(4): 
   ...:     alist.append(foo()) 
   ...:                                                                                        
In [7]: alist                                                                                  
Out[7]: 
[array([0, 0, 2, 0, 0, 0, 0, 0, 8, 9]),
 array([0, 0, 0, 3, 0, 5, 0, 0, 0, 9]),
 array([0, 1, 0, 0, 0, 0, 6, 7, 0, 0]),
 array([0, 1, 0, 0, 0, 0, 0, 0, 8, 0])]

来自那个的密集数组:

In [8]: np.vstack(alist)                                                                       
Out[8]: 
array([[0, 0, 2, 0, 0, 0, 0, 0, 8, 9],
       [0, 0, 0, 3, 0, 5, 0, 0, 0, 9],
       [0, 1, 0, 0, 0, 0, 6, 7, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 8, 0]])

来自该数组的稀疏矩阵:

In [9]: M = sparse.coo_matrix(_)                                                               
In [10]: M                                                                                     
Out[10]: 
<4x10 sparse matrix of type '<class 'numpy.int64'>'
    with 11 stored elements in COOrdinate format>
In [11]: M                                                                                     
Out[11]: 
<4x10 sparse matrix of type '<class 'numpy.int64'>'
    with 11 stored elements in COOrdinate format>
In [12]: print(M)                                                                              
  (0, 2)    2
  (0, 8)    8
  (0, 9)    9
  (1, 3)    3
  (1, 5)    5
  (1, 9)    9
  (2, 1)    1
  (2, 6)    6
  (2, 7)    7
  (3, 1)    1
  (3, 8)    8

备选方案 - 从每个数组创建稀疏矩阵,然后加入它们:

In [13]: alist = []                                                                            
In [14]: for i in range(4): 
    ...:     alist.append(sparse.coo_matrix(foo())) 
    ...:                                                                                       
In [15]: alist                                                                                 
Out[15]: 
[<1x10 sparse matrix of type '<class 'numpy.int64'>'
    with 2 stored elements in COOrdinate format>,
 <1x10 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in COOrdinate format>,
 <1x10 sparse matrix of type '<class 'numpy.int64'>'
    with 2 stored elements in COOrdinate format>,
 <1x10 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in COOrdinate format>]
In [16]: M1=sparse.vstack(alist)                                                               
In [17]: M1                                                                                    
Out[17]: 
<4x10 sparse matrix of type '<class 'numpy.longlong'>'
    with 10 stored elements in COOrdinate format>

稀疏矩阵(coo 格式)将其信息存储在 3 个数组中(与 [12] 比较):

In [18]: M.data                                                                                
Out[18]: array([2, 8, 9, 3, 5, 9, 1, 6, 7, 1, 8])
In [19]: M.row                                                                                 
Out[19]: array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3], dtype=int32)
In [20]: M.col                                                                                 
Out[20]: array([2, 8, 9, 3, 5, 9, 1, 6, 7, 1, 8], dtype=int32)

注意这些数组中的前 3 个值如何对应于以下的非零元素:

array([0, 0, 2, 0, 0, 0, 0, 0, 8, 9]

In [21]: m1 =sparse.coo_matrix(np.array([0, 0, 2, 0, 0, 0, 0, 0, 8, 9])) 
In [24]: m1.data                                                                               
Out[24]: array([2, 8, 9])
In [25]: m1.row                                                                                
Out[25]: array([0, 0, 0], dtype=int32)
In [26]: m1.col                                                                                
Out[26]: array([2, 8, 9], dtype=int32)

构造稀疏矩阵的一种常见方法是直接收集 datarowcol 数组,可能作为列表,然后馈送到 coo_matrix:

sparse.coo_matrix((data, (row, col)), shape=(N,M))

sparse.vstack 将任务交给 sparse.bmat。查看 bmat 代码,了解它如何将每个输入的 data/row/col 属性收集到复合数组中,然后创建 coo 矩阵。

示例矩阵生成器:

In [38]: data, row, col = [], [], []                                                           
In [39]: for i in range(4): 
    ...:     idx = np.random.randint(0,10,size=4) 
    ...:     data.append(idx) 
    ...:     row.append(np.ones(len(idx),int)*i) 
    ...:     col.append(idx) 
    ...: data = np.hstack(data) 
    ...: row = np.hstack(row) 
    ...: col = np.hstack(col) 
    ...: M = sparse.coo_matrix((data, (row, col)), shape=(4,10)) 
    ...:  
    ...:                                                                                       
In [40]: M                                                                                     
Out[40]: 
<4x10 sparse matrix of type '<class 'numpy.int64'>'
    with 16 stored elements in COOrdinate format>
In [41]: data                                                                                  
Out[41]: array([5, 8, 6, 2, 6, 3, 5, 8, 6, 1, 3, 5, 0, 2, 7, 0])
In [42]: row                                                                                   
Out[42]: array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
In [43]: M.A                                                                                   
Out[43]: 
array([[0, 0, 2, 0, 0, 5, 6, 0, 8, 0],
       [0, 0, 0, 3, 0, 5, 6, 0, 8, 0],
       [0, 1, 0, 3, 0, 5, 6, 0, 0, 0],
       [0, 0, 2, 0, 0, 0, 0, 7, 0, 0]])