堆叠两个不同维度的稀疏矩阵
Stacking two sparse matrices with different dimensions
我有两个稀疏矩阵(从 sklearn
HashVectorizer
中创建,来自两组特征 - 每组对应一个特征)。我想将它们连接起来,以便以后将它们用于聚类。但是,我遇到了维度问题,因为这两个矩阵的行维度不同。
这是一个例子:
Xa = [-0.57735027 -0.57735027 0.57735027 -0.57735027 -0.57735027 0.57735027
0.5 0.5 -0.5 0.5 0.5 -0.5 0.5
0.5 -0.5 0.5 -0.5 0.5 0.5 -0.5
0.5 0.5 ]
Xb = [-0.57735027 -0.57735027 0.57735027 -0.57735027 0.57735027 0.57735027
-0.5 0.5 0.5 0.5 -0.5 -0.5 0.5
-0.5 -0.5 -0.5 0.5 0.5 ]
Xa
和 Xb
都是 <class 'scipy.sparse.csr.csr_matrix'>
类型。形状为 Xa.shape = (6, 1048576) Xb.shape = (5, 1048576)
。我得到的错误是(我现在知道为什么会这样):
X = hstack((Xa, Xb))
File "/usr/local/lib/python2.7/site-packages/scipy/sparse/construct.py", line 464, in hstack
return bmat([blocks], format=format, dtype=dtype)
File "/usr/local/lib/python2.7/site-packages/scipy/sparse/construct.py", line 581, in bmat
'row dimensions' % i)
ValueError: blocks[0,:] has incompatible row dimensions
有没有办法堆叠稀疏矩阵,尽管它们的维度不规则?也许有一些填充?
我查看了这些帖子:
Concatenate sparse matrices in Python using SciPy/Numpy
Is there an efficient way of concatenating scipy.sparse matrices?.
您可以用空的稀疏矩阵填充它。
您想水平堆叠它,所以您需要填充较小的矩阵,使其具有与相同的行数更大的矩阵。为此,你垂直堆叠它与形状矩阵(difference in number of rows, number of columns of original matrix)
。
像这样:
from scipy.sparse import csr_matrix
from scipy.sparse import hstack
from scipy.sparse import vstack
# Create 2 empty sparse matrix for demo
Xa = csr_matrix((4, 4))
Xb = csr_matrix((3, 5))
diff_n_rows = Xa.shape[0] - Xb.shape[0]
Xb_new = vstack((Xb, csr_matrix((diff_n_rows, Xb.shape[1]))))
#where diff_n_rows is the difference of the number of rows between Xa and Xb
X = hstack((Xa, Xb_new))
X
这导致:
<4x9 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in COOrdinate format>
我有两个稀疏矩阵(从 sklearn
HashVectorizer
中创建,来自两组特征 - 每组对应一个特征)。我想将它们连接起来,以便以后将它们用于聚类。但是,我遇到了维度问题,因为这两个矩阵的行维度不同。
这是一个例子:
Xa = [-0.57735027 -0.57735027 0.57735027 -0.57735027 -0.57735027 0.57735027
0.5 0.5 -0.5 0.5 0.5 -0.5 0.5
0.5 -0.5 0.5 -0.5 0.5 0.5 -0.5
0.5 0.5 ]
Xb = [-0.57735027 -0.57735027 0.57735027 -0.57735027 0.57735027 0.57735027
-0.5 0.5 0.5 0.5 -0.5 -0.5 0.5
-0.5 -0.5 -0.5 0.5 0.5 ]
Xa
和 Xb
都是 <class 'scipy.sparse.csr.csr_matrix'>
类型。形状为 Xa.shape = (6, 1048576) Xb.shape = (5, 1048576)
。我得到的错误是(我现在知道为什么会这样):
X = hstack((Xa, Xb))
File "/usr/local/lib/python2.7/site-packages/scipy/sparse/construct.py", line 464, in hstack
return bmat([blocks], format=format, dtype=dtype)
File "/usr/local/lib/python2.7/site-packages/scipy/sparse/construct.py", line 581, in bmat
'row dimensions' % i)
ValueError: blocks[0,:] has incompatible row dimensions
有没有办法堆叠稀疏矩阵,尽管它们的维度不规则?也许有一些填充?
我查看了这些帖子:
Concatenate sparse matrices in Python using SciPy/Numpy
Is there an efficient way of concatenating scipy.sparse matrices?.
您可以用空的稀疏矩阵填充它。
您想水平堆叠它,所以您需要填充较小的矩阵,使其具有与相同的行数更大的矩阵。为此,你垂直堆叠它与形状矩阵(difference in number of rows, number of columns of original matrix)
。
像这样:
from scipy.sparse import csr_matrix
from scipy.sparse import hstack
from scipy.sparse import vstack
# Create 2 empty sparse matrix for demo
Xa = csr_matrix((4, 4))
Xb = csr_matrix((3, 5))
diff_n_rows = Xa.shape[0] - Xb.shape[0]
Xb_new = vstack((Xb, csr_matrix((diff_n_rows, Xb.shape[1]))))
#where diff_n_rows is the difference of the number of rows between Xa and Xb
X = hstack((Xa, Xb_new))
X
这导致:
<4x9 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in COOrdinate format>