连接 Scipy 稀疏矩阵而不需要慢 scipy.sparse.vstack
Joining Scipy sparse matrices without the slow scipy.sparse.vstack
我需要在一个循环中计算一个大小为 1xN 的数组,并将每个新数组堆叠在前一个数组之上。循环的长度为 1,M 并且函数的工作方式类似于以下内容:
import numpy as np
import scipy.sparse as sp
mat = np.random.uniform(size=(1,N))
mat_sp = sp.coo_matrix(mat)
for i in range(1,M):
mat_new = np.random.uniform(size=(1,N))
mat_sp_new = sp.coo_matrix(mat_new)
mat_sp = sp.vstack((mat_sp,mat_sp_new))
这将导致 MxN 矩阵。但是,使用 scipy.sparse.vstack 执行此操作非常慢。与 numpy.vstack 或什至仅与预分配矩阵和 N = 10000 进行比较:
In [1]: %timeit sp.vstack((mat_sp,mat_sp))
315 µs ± 7.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [2]: %timeit np.vstack((mat,mat))
8.63 µs ± 87.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: mat_final = np.zeros((2,10000))
In [4]: %timeit mat_final[0]=mat; mat_final[1]=mat
4.03 µs ± 92.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
我的问题是有时我需要最终矩阵的大小达到 M = 105 和 N = 106 NumPy 会给出 MemoryError
.
有没有一种不是很慢的堆叠稀疏行向量的方法?
使用重复的 stack
,如 mat_sp = sp.vstack((mat_sp,mat_sp_new))
,是一个坏主意,即使在使用 np.vstack
时也是如此。最好将所有数组收集到一个列表中,然后只使用一次 stack
。 List append 在迭代使用时效率更高,并且 stacks
旨在处理整个对象列表,而不仅仅是一次两个。 List append 与对象引用就地工作;数组操作 return 每次都是一个全新的数组。
让我举例说明:
In [1]: from scipy import sparse
制作稀疏一维数组的函数:
In [2]: def foo():
...: x = np.zeros(10,int)
...: idx = np.random.randint(0,10,size=4)
...: x[idx] = idx
...: return x
...:
In [3]: foo()
Out[3]: array([0, 1, 2, 3, 0, 0, 0, 0, 0, 9])
In [4]: foo()
Out[4]: array([0, 0, 0, 3, 0, 0, 0, 0, 0, 9])
简单列表追加:
In [5]: alist = []
In [6]: for i in range(4):
...: alist.append(foo())
...:
In [7]: alist
Out[7]:
[array([0, 0, 2, 0, 0, 0, 0, 0, 8, 9]),
array([0, 0, 0, 3, 0, 5, 0, 0, 0, 9]),
array([0, 1, 0, 0, 0, 0, 6, 7, 0, 0]),
array([0, 1, 0, 0, 0, 0, 0, 0, 8, 0])]
来自那个的密集数组:
In [8]: np.vstack(alist)
Out[8]:
array([[0, 0, 2, 0, 0, 0, 0, 0, 8, 9],
[0, 0, 0, 3, 0, 5, 0, 0, 0, 9],
[0, 1, 0, 0, 0, 0, 6, 7, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 8, 0]])
来自该数组的稀疏矩阵:
In [9]: M = sparse.coo_matrix(_)
In [10]: M
Out[10]:
<4x10 sparse matrix of type '<class 'numpy.int64'>'
with 11 stored elements in COOrdinate format>
In [11]: M
Out[11]:
<4x10 sparse matrix of type '<class 'numpy.int64'>'
with 11 stored elements in COOrdinate format>
In [12]: print(M)
(0, 2) 2
(0, 8) 8
(0, 9) 9
(1, 3) 3
(1, 5) 5
(1, 9) 9
(2, 1) 1
(2, 6) 6
(2, 7) 7
(3, 1) 1
(3, 8) 8
备选方案 - 从每个数组创建稀疏矩阵,然后加入它们:
In [13]: alist = []
In [14]: for i in range(4):
...: alist.append(sparse.coo_matrix(foo()))
...:
In [15]: alist
Out[15]:
[<1x10 sparse matrix of type '<class 'numpy.int64'>'
with 2 stored elements in COOrdinate format>,
<1x10 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in COOrdinate format>,
<1x10 sparse matrix of type '<class 'numpy.int64'>'
with 2 stored elements in COOrdinate format>,
<1x10 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in COOrdinate format>]
In [16]: M1=sparse.vstack(alist)
In [17]: M1
Out[17]:
<4x10 sparse matrix of type '<class 'numpy.longlong'>'
with 10 stored elements in COOrdinate format>
稀疏矩阵(coo
格式)将其信息存储在 3 个数组中(与 [12] 比较):
In [18]: M.data
Out[18]: array([2, 8, 9, 3, 5, 9, 1, 6, 7, 1, 8])
In [19]: M.row
Out[19]: array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3], dtype=int32)
In [20]: M.col
Out[20]: array([2, 8, 9, 3, 5, 9, 1, 6, 7, 1, 8], dtype=int32)
注意这些数组中的前 3 个值如何对应于以下的非零元素:
array([0, 0, 2, 0, 0, 0, 0, 0, 8, 9]
In [21]: m1 =sparse.coo_matrix(np.array([0, 0, 2, 0, 0, 0, 0, 0, 8, 9]))
In [24]: m1.data
Out[24]: array([2, 8, 9])
In [25]: m1.row
Out[25]: array([0, 0, 0], dtype=int32)
In [26]: m1.col
Out[26]: array([2, 8, 9], dtype=int32)
构造稀疏矩阵的一种常见方法是直接收集 data
、row
、col
数组,可能作为列表,然后馈送到 coo_matrix
:
sparse.coo_matrix((data, (row, col)), shape=(N,M))
sparse.vstack
将任务交给 sparse.bmat
。查看 bmat
代码,了解它如何将每个输入的 data/row/col
属性收集到复合数组中,然后创建 coo
矩阵。
示例矩阵生成器:
In [38]: data, row, col = [], [], []
In [39]: for i in range(4):
...: idx = np.random.randint(0,10,size=4)
...: data.append(idx)
...: row.append(np.ones(len(idx),int)*i)
...: col.append(idx)
...: data = np.hstack(data)
...: row = np.hstack(row)
...: col = np.hstack(col)
...: M = sparse.coo_matrix((data, (row, col)), shape=(4,10))
...:
...:
In [40]: M
Out[40]:
<4x10 sparse matrix of type '<class 'numpy.int64'>'
with 16 stored elements in COOrdinate format>
In [41]: data
Out[41]: array([5, 8, 6, 2, 6, 3, 5, 8, 6, 1, 3, 5, 0, 2, 7, 0])
In [42]: row
Out[42]: array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
In [43]: M.A
Out[43]:
array([[0, 0, 2, 0, 0, 5, 6, 0, 8, 0],
[0, 0, 0, 3, 0, 5, 6, 0, 8, 0],
[0, 1, 0, 3, 0, 5, 6, 0, 0, 0],
[0, 0, 2, 0, 0, 0, 0, 7, 0, 0]])
我需要在一个循环中计算一个大小为 1xN 的数组,并将每个新数组堆叠在前一个数组之上。循环的长度为 1,M 并且函数的工作方式类似于以下内容:
import numpy as np
import scipy.sparse as sp
mat = np.random.uniform(size=(1,N))
mat_sp = sp.coo_matrix(mat)
for i in range(1,M):
mat_new = np.random.uniform(size=(1,N))
mat_sp_new = sp.coo_matrix(mat_new)
mat_sp = sp.vstack((mat_sp,mat_sp_new))
这将导致 MxN 矩阵。但是,使用 scipy.sparse.vstack 执行此操作非常慢。与 numpy.vstack 或什至仅与预分配矩阵和 N = 10000 进行比较:
In [1]: %timeit sp.vstack((mat_sp,mat_sp))
315 µs ± 7.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [2]: %timeit np.vstack((mat,mat))
8.63 µs ± 87.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: mat_final = np.zeros((2,10000))
In [4]: %timeit mat_final[0]=mat; mat_final[1]=mat
4.03 µs ± 92.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
我的问题是有时我需要最终矩阵的大小达到 M = 105 和 N = 106 NumPy 会给出 MemoryError
.
有没有一种不是很慢的堆叠稀疏行向量的方法?
使用重复的 stack
,如 mat_sp = sp.vstack((mat_sp,mat_sp_new))
,是一个坏主意,即使在使用 np.vstack
时也是如此。最好将所有数组收集到一个列表中,然后只使用一次 stack
。 List append 在迭代使用时效率更高,并且 stacks
旨在处理整个对象列表,而不仅仅是一次两个。 List append 与对象引用就地工作;数组操作 return 每次都是一个全新的数组。
让我举例说明:
In [1]: from scipy import sparse
制作稀疏一维数组的函数:
In [2]: def foo():
...: x = np.zeros(10,int)
...: idx = np.random.randint(0,10,size=4)
...: x[idx] = idx
...: return x
...:
In [3]: foo()
Out[3]: array([0, 1, 2, 3, 0, 0, 0, 0, 0, 9])
In [4]: foo()
Out[4]: array([0, 0, 0, 3, 0, 0, 0, 0, 0, 9])
简单列表追加:
In [5]: alist = []
In [6]: for i in range(4):
...: alist.append(foo())
...:
In [7]: alist
Out[7]:
[array([0, 0, 2, 0, 0, 0, 0, 0, 8, 9]),
array([0, 0, 0, 3, 0, 5, 0, 0, 0, 9]),
array([0, 1, 0, 0, 0, 0, 6, 7, 0, 0]),
array([0, 1, 0, 0, 0, 0, 0, 0, 8, 0])]
来自那个的密集数组:
In [8]: np.vstack(alist)
Out[8]:
array([[0, 0, 2, 0, 0, 0, 0, 0, 8, 9],
[0, 0, 0, 3, 0, 5, 0, 0, 0, 9],
[0, 1, 0, 0, 0, 0, 6, 7, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 8, 0]])
来自该数组的稀疏矩阵:
In [9]: M = sparse.coo_matrix(_)
In [10]: M
Out[10]:
<4x10 sparse matrix of type '<class 'numpy.int64'>'
with 11 stored elements in COOrdinate format>
In [11]: M
Out[11]:
<4x10 sparse matrix of type '<class 'numpy.int64'>'
with 11 stored elements in COOrdinate format>
In [12]: print(M)
(0, 2) 2
(0, 8) 8
(0, 9) 9
(1, 3) 3
(1, 5) 5
(1, 9) 9
(2, 1) 1
(2, 6) 6
(2, 7) 7
(3, 1) 1
(3, 8) 8
备选方案 - 从每个数组创建稀疏矩阵,然后加入它们:
In [13]: alist = []
In [14]: for i in range(4):
...: alist.append(sparse.coo_matrix(foo()))
...:
In [15]: alist
Out[15]:
[<1x10 sparse matrix of type '<class 'numpy.int64'>'
with 2 stored elements in COOrdinate format>,
<1x10 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in COOrdinate format>,
<1x10 sparse matrix of type '<class 'numpy.int64'>'
with 2 stored elements in COOrdinate format>,
<1x10 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in COOrdinate format>]
In [16]: M1=sparse.vstack(alist)
In [17]: M1
Out[17]:
<4x10 sparse matrix of type '<class 'numpy.longlong'>'
with 10 stored elements in COOrdinate format>
稀疏矩阵(coo
格式)将其信息存储在 3 个数组中(与 [12] 比较):
In [18]: M.data
Out[18]: array([2, 8, 9, 3, 5, 9, 1, 6, 7, 1, 8])
In [19]: M.row
Out[19]: array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3], dtype=int32)
In [20]: M.col
Out[20]: array([2, 8, 9, 3, 5, 9, 1, 6, 7, 1, 8], dtype=int32)
注意这些数组中的前 3 个值如何对应于以下的非零元素:
array([0, 0, 2, 0, 0, 0, 0, 0, 8, 9]
In [21]: m1 =sparse.coo_matrix(np.array([0, 0, 2, 0, 0, 0, 0, 0, 8, 9]))
In [24]: m1.data
Out[24]: array([2, 8, 9])
In [25]: m1.row
Out[25]: array([0, 0, 0], dtype=int32)
In [26]: m1.col
Out[26]: array([2, 8, 9], dtype=int32)
构造稀疏矩阵的一种常见方法是直接收集 data
、row
、col
数组,可能作为列表,然后馈送到 coo_matrix
:
sparse.coo_matrix((data, (row, col)), shape=(N,M))
sparse.vstack
将任务交给 sparse.bmat
。查看 bmat
代码,了解它如何将每个输入的 data/row/col
属性收集到复合数组中,然后创建 coo
矩阵。
示例矩阵生成器:
In [38]: data, row, col = [], [], []
In [39]: for i in range(4):
...: idx = np.random.randint(0,10,size=4)
...: data.append(idx)
...: row.append(np.ones(len(idx),int)*i)
...: col.append(idx)
...: data = np.hstack(data)
...: row = np.hstack(row)
...: col = np.hstack(col)
...: M = sparse.coo_matrix((data, (row, col)), shape=(4,10))
...:
...:
In [40]: M
Out[40]:
<4x10 sparse matrix of type '<class 'numpy.int64'>'
with 16 stored elements in COOrdinate format>
In [41]: data
Out[41]: array([5, 8, 6, 2, 6, 3, 5, 8, 6, 1, 3, 5, 0, 2, 7, 0])
In [42]: row
Out[42]: array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
In [43]: M.A
Out[43]:
array([[0, 0, 2, 0, 0, 5, 6, 0, 8, 0],
[0, 0, 0, 3, 0, 5, 6, 0, 8, 0],
[0, 1, 0, 3, 0, 5, 6, 0, 0, 0],
[0, 0, 2, 0, 0, 0, 0, 7, 0, 0]])