使用 memmap 文件进行批处理

Question

我有一个庞大的数据集，我希望对其进行主成分分析。我受限于 RAM 和 PCA 的计算效率。因此，我转而使用迭代 PCA。

数据集大小-(140000,3504)

documentation 表示 This algorithm has constant memory complexity, on the order of batch_size, enabling use of np.memmap files without loading the entire file into memory.

这真的很好，但不确定如何利用它。

我尝试加载一个 memmap，希望它能以块的形式访问它，但我的 RAM 崩溃了。我下面的代码最终使用了大量 RAM：

ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)

当我说 "my RAM blew" 时，我看到的 Traceback 是：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\sklearn\base.py", line 433, in fit_transfo
rm
    return self.fit(X, **fit_params).transform(X)
  File "C:\Python27\lib\site-packages\sklearn\decomposition\incremental_pca.py",
 line 171, in fit
    X = check_array(X, dtype=np.float)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 347, in
 check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
MemoryError

如何在不影响准确性的情况下通过减小批量大小来改进这一点？

我的诊断思路：

我查看了 sklearn 源代码，在 fit() 函数中 Source Code 我可以看到以下内容。这对我来说很有意义，但我仍然不确定我的情况出了什么问题。

for batch in gen_batches(n_samples, self.batch_size_):
        self.partial_fit(X[batch])
return self

编辑： 最坏的情况我将不得不为 iterativePCA 编写自己的代码，它通过读取和关闭 .npy 文件进行批处理。但这会破坏利用已经存在的黑客攻击的目的。

编辑2： 如果我能以某种方式删除一批已处理的 memmap file。这很有意义。

编辑3： 理想情况下，如果 IncrementalPCA.fit() 只是使用批处理，它不应该使我的 RAM 崩溃。发布整个代码，只是为了确保我之前在将 memmap 完全刷新到磁盘时没有犯错。

temp_train_data=X_train[1000:]
temp_labels=y[1000:] 
out = np.empty((200001, 3504), np.int64)
for index,row in enumerate(temp_train_data):
    actual_index=index+1000
    data=X_train[actual_index-1000:actual_index+1].ravel()
    __,cd_i=pywt.dwt(data,'haar')
    out[index] = cd_i
out.flush()
pca_obj=IncrementalPCA()
clf = pca_obj.fit(out)

令人惊讶的是，我注意到 out.flush 没有释放我的记忆。有没有办法使用 del out 完全释放我的内存，然后有人将文件指针传递给 IncrementalPCA.fit().

Answer 1

以下是否单独触发崩溃？

X_train_mmap = np.memmap('my_array.mmap', dtype=np.float16,
                         mode='w+', shape=(n_samples, n_features))
clf = IncrementalPCA(n_components=50).fit(X_train_mmap)

如果没有，那么您可以使用该模型使用批次将（迭代地投影您的数据）转换为较小的数据：

X_projected_mmap = np.memmap('my_result_array.mmap', dtype=np.float16,
                             mode='w+', shape=(n_samples, clf.n_components))
for batch in gen_batches(n_samples, self.batch_size_):
    X_batch_projected = clf.transform(X_train_mmap[batch])
    X_projected_mmap[batch] = X_batch_projected

我没有测试过该代码，但我希望你能理解。

Answer 2

您在 32 位环境中遇到 sklearn 问题。我假设您正在使用 np.float16 因为您处于 32 位环境中并且您需要它来允许您创建 memmap 对象而不会出现 numpy thowing 错误。

在 64 位环境中（使用 Python3.3 在 Windows 上测试 64 位），您的代码开箱即用。因此，如果您有 64 位计算机可用 - 安装 python 64 位和 numpy、scipy、scikit-learn 64 位，您就可以开始了。

不幸的是，如果您不能这样做，就没有简单的解决方法。我有raised an issue on github here，但打补丁不容易。根本问题在于，在库中，如果您的类型是 float16，则会触发数组到内存的副本。详情如下。

所以，我希望您可以访问具有大量 RAM 的 64 位环境。如果没有，您将不得不自己拆分数组并对其进行批处理，这是一项相当大的任务...

N.B 很高兴看到你去源头诊断你的问题:) 但是，如果你查看代码失败的行（来自 Traceback），您将看到您找到的 for batch in gen_batches 代码从未到达。

详细诊断：

OP代码实际产生的错误：

import numpy as np
from sklearn.decomposition import IncrementalPCA

ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)

是

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\sklearn\base.py", line 433, in fit_transfo
rm
    return self.fit(X, **fit_params).transform(X)
  File "C:\Python27\lib\site-packages\sklearn\decomposition\incremental_pca.py",
 line 171, in fit
    X = check_array(X, dtype=np.float)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 347, in
 check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
MemoryError

调用check_array(code link) uses dtype=np.float, but the original array has dtype=np.float16. Even though the check_array() function defaults to copy=False and passes this to np.array(), this is ignored (as per the docs)，满足dtype不同；因此 np.array.

制作了一份副本

这可以在 IncrementalPCA 代码中解决，方法是确保为具有 dtype in (np.float16, np.float32, np.float64) 的数组保留 dtype。然而，当我尝试那个补丁时，它只是将 MemoryError 沿着执行链推得更远。

同样的复制问题发生在代码 calls linalg.svd() from the main scipy code and this time the error occurs during a call to gesdd()，一个来自 lapack 的包装本机函数时。因此，我认为没有办法对此进行修补（至少不是一种简单的方法 - 它至少需要更改核心代码 scipy）。

使用 memmap 文件进行批处理

Using memmap files for batch processing

python

numpy

pca

scikit-learn

详细诊断：