在 scipy 中创建稀疏矩阵时覆盖而不是添加重复的三元组

Question

在 scipy 中，要从三重格式数据（行、列和数据数组）创建稀疏矩阵，默认行为是对所有重复项的数据值求和。我可以更改此行为以改为覆盖（或什么也不做）吗？

例如：

import scipy.sparse as sparse

rows = [0, 0]
cols = [0, 0]
data = [1, 1]
S = sparse.coo_matrix((data, (rows, cols)))

在这里，S.todense() 等于 matrix([[2]]) 但我希望它是 matrix([[1]])。

在 documentation of sparse.coo_matrix 中，显示为

By default when converting to CSR or CSC format, duplicate (i,j) entries will be summed together. This facilitates efficient construction of finite element matrices and the like.

从该表述看来，可能还有默认选项以外的其他选项。

Answer 1

我在 scipy github 上看到过关于对这个求和给予更多控制的讨论，但我不知道有任何生产变化。正如文档所指出的，对重复项求和有着悠久的传统。

创建时，coo 矩阵不求和；它只是将您的参数分配给它的属性：

In [697]: S = sparse.coo_matrix((data, (rows, cols)))
In [698]: S.data
Out[698]: array([1, 1])
In [699]: S.row
Out[699]: array([0, 0], dtype=int32)
In [700]: S.col
Out[700]: array([0, 0], dtype=int32)

转换为密集（或 csr/csc）进行求和 - 但不会改变 S 本身：

In [701]: S.A
Out[701]: array([[2]])
In [702]: S.data
Out[702]: array([1, 1])

您可以就地执行求和：

In [703]: S.sum_duplicates()
In [704]: S.data
Out[704]: array([2], dtype=int32)

我不知道有什么方法可以删除重复项或绕过该操作。我可能会查找相关问题。

=================

S.todok() 执行就地求和（即更改 S）。查看该代码，我发现它调用了 self.sum_duplicates。以下复制没有总和：

In [727]: dok=sparse.dok_matrix((S.shape),dtype=S.dtype)
In [728]: dok.update(zip(zip(S.row,S.col),S.data))
In [729]: dok
Out[729]: 
<1x1 sparse matrix of type '<class 'numpy.int32'>'
    with 1 stored elements in Dictionary Of Keys format>
In [730]: print(dok)
  (0, 0)    1
In [731]: S
Out[731]: 
<1x1 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in COOrdinate format>
In [732]: dok.A
Out[732]: array([[1]])

这是一个字典更新，所以最终值是最后一个重复值。我在其他地方发现 dok.update 是一种向稀疏矩阵添加值的非常快速的方法。

tocsr 本质上是求和； tolil 使用 tocsr；所以这个 todok 方法可能是最简单的。

Answer 2

如果您只需要值 1:

S.sum_duplicates()
S.data[:]=1

在 scipy 中创建稀疏矩阵时覆盖而不是添加重复的三元组

Overwrite instead of add for duplicate triplets when creating sparse matrix in scipy

python

scipy

sparse-matrix