使用 += 一个到大矩阵时的内存错误

Question

我正在尝试在 Python 中编写一个折叠的 Gibbs 采样器，并且在为我的一个矩阵创建初始值时运行遇到内存问题。我是 Python 的新手，所以下面是我正在做的解释的概要。 4 点我收到 MemoryError

我的目标是：

创建一个 T,M 零矩阵（加上 alpha 值），其中 T 是一些小数字，例如 2:6，M 可以非常大

import numpy as np
import pandas as pd
M = 500
N = 10000
T = 6
alpha = .3
NZM = np.zeros((T,M), dtype = np.float64) + alpha

创建一个由 T 个主题的多项式分布生成的数字组成的 M,N 矩阵，如下所示。

Z = np.where(np.random.multinomial(1,[1./ntopics]*ntopics,size = M*N )==1)[1]
Z

array([[1, 3, 0, ..., 5, 3, 1],
       [3, 5, 0, ..., 5, 1, 2],
       [4, 5, 4, ..., 1, 3, 5],
       ..., 
       [1, 2, 1, ..., 0, 3, 4],
       [0, 5, 2, ..., 2, 5, 0],
       [2, 3, 2, ..., 4, 1, 5]])

使用 .reshape(M*N)

创建索引

Z_index = Z.reshape(M*N) 

array([1, 3, 0, ..., 4, 1, 5])

这一步是我收到错误的地方。我使用 Z_index 向在 Z 中显示为值的 NZM 的每一行添加一个。但是，下面的选项 1 非常慢，而选项 2 有内存错误。

# Option 1
for m in xrange(M):
    NZM[Z_index,m] += 1

# Option 2
NZM[Z_index,:] += 1  



---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-88-087ab1ede05d> in <module>()
      2 # a memory error
      3 
----> 4 NZM[Z_index,:] += 1


MemoryError:

每次在 Z_index 中出现时，我都想在该数组的一行中添加一个。有没有一种我不知道的快速有效地做到这一点的方法？感谢您花时间阅读本文。

Answer 1

我的问题与问题 here 重复，但它来自一个我认为是独一无二的查询，搜索由大量重复索引引起的错误的人会更容易找到它。

所以一个简单的健全性检查表明这并没有像我想的那样做。我假设，给定一个包含同一行的倍数的索引，每次该行出现在索引中时，+= 都会向这些行添加一个。

import numpy as np
import pandas as pd

NWZ = np.zeros((10,10), dtype=np.float64) + 1

index = np.repeat([0,3], [1, 3], axis=0)

index

array([0, 3, 3, 3])

NWZ[index,:] += 1

NWZ

array([[ 2.,  2.,  2.,  2.,  2.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 2.,  2.,  2.,  2.,  2.],
       [ 1.,  1.,  1.,  1.,  1.]])

我们可以看到情况并非如此，因为为同一行提供 += 多个实例只会导致在原始行中添加一个。因为 += 执行 'in place' 操作我假设这个操作会 return

array([[ 2.,  2.,  2.,  2.,  2.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 4.,  4.,  4.,  4.,  4.],
       [ 1.,  1.,  1.,  1.,  1.]])

然而，通过显式使用 .__iadd__(1)，我们看到加法在遍历索引时并未累积执行。

NWZ[index,:].__iadd__(1)

array([[ 2.,  2.,  2.,  2.,  2.],
       [ 2.,  2.,  2.,  2.,  2.],
       [ 2.,  2.,  2.,  2.,  2.],
       [ 2.,  2.,  2.,  2.,  2.]])

您可以去 here 直观地解释为什么这不会（并且用户断言不应该）发生。

我的问题的另一种解决方案是首先创建一个频率 table 行 n 出现在我的重复索引中的次数。然后，因为我只是在做加法，所以将这些频率添加到它们对应的行中。

from scipy.stats import itemfreq

index_counts = itemfreq(index)

N = len(index_counts[:,1])
NWZ[index_counts[:,0].astype(int),:] += index_counts[:,1].reshape(N,1)
NWZ

array([[ 2.,  2.,  2.,  2.,  2.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 4.,  4.,  4.,  4.,  4.],
       [ 1.,  1.,  1.,  1.,  1.]])

使用 += 一个到大矩阵时的内存错误

Memory error when using += one to large matrix

python

matrix

out-of-memory