如何使用批处理迭代地将 ndarray 写入 .npy 文件

Question

我正在为机器学习应用程序生成大型数据集，它是一个形状为 (N,X,Y) 的 numpy 数组。这里N是样本数，X是一个样本的输入，Y是一个样本的目标。我想以 .npy 格式保存这个数组。我有很多样本（N 非常大），所以最终的数据集大约有 10+ GB。这意味着我不能创建整个数据集然后保存它，因为它会淹没我的记忆。

是否可以将 n 批样本迭代写入此文件？因此，我想一次将 256 个样本的批次附加到文件中 ((256,X,Y))。

Answer 1

我发现可以使用 np.tofile and np.fromfile。请注意，下面的代码仍然假设您在内存中拥有整个数组，但您当然可以更改要动态生成的批次。

import numpy as np

N = 1000;
X = 10;
Y = 1;
my_data = np.random.random((N, X, Y));
print(my_data[700,:,:])

batch_size = 10;

with open('test.dat',mode='wb+') as f:
    i = 0;
    while i < N:
        batch = my_data[i:i+batch_size,:,:]
        batch.tofile(f)

        i += batch_size;

x = np.fromfile('test.dat',dtype=my_data.dtype)

x = np.reshape(x, (N,X,Y))
print(x[700,:,:])

如@hpaulj 所述，无法使用 np.load.

加载此文件

Answer 2

这是一个基于 numpy 实现 save 的解决方案，用于编写包含形状和类型信息的标准 npy 文件：

import numpy as np
import numpy.lib as npl

a = np.random.random((30, 3, 2))
a1 = a[:10]
a2 = a[10:]

filename = 'out.npy'
with open(filename, 'wb+') as f:
    header = npl.format.header_data_from_array_1_0(a1)
    npl.format.write_array_header_1_0(f, header)
    a1.tofile(f)
    a2.tofile(f)
    f.seek(0)
    header['shape'] = (len(a1) + len(a2), *header['shape'][1:])
    npl.format.write_array_header_1_0(f, header)

assert (np.load(filename) == a).all()

这适用于 C_CONTIGUOUS 个没有 Python 个对象的数组。

如何使用批处理迭代地将 ndarray 写入 .npy 文件

How to write ndarray to .npy file iteratively with batches

python

memory-management

numpy