使用少于 4 个字节的浮点值（0 到 1 之间）的二进制存储？

Question

我需要将大量 numpy 向量存储到磁盘。现在，我要存储的向量长约 24 亿个元素，数据为 float64。序列化到磁盘时，这需要大约 18GB space。

如果我使用 struct.pack() 并使用 float32（4 个字节），我可以将其减少到 ~9GB。我不需要接近这个数量的精密磁盘 space 很快就会成为一个问题，因为我预计我需要存储的值的数量可能会增长一个或两个数量级。

我在想，如果我可以访问前 4 个有效数字，我可以将这些值存储在一个 int 中，并且只使用 space 的 1 或 2 个字节。但是，我不知道如何有效地做到这一点。有人有什么想法或建议吗？

Answer 1

如果你的数据在0到1之间，16bit就够了你可以将数据保存为uint16:

data16 = (65535 * data).round().astype(uint16)

并用

扩展数据

data = data16 / 65535.0

Answer 2

使用 struct.pack() 和 f 类型代码将它们放入 4 字节数据包。

Answer 3

一般来说，我不建议使用 float16，但就其价值而言，它很容易做到。

但是，struct 模块无法转换 to/from 16 位浮点数。

因此，您需要执行类似以下操作：

import numpy as np
x = np.linspace(0, 1, 1000)

x = x.astype(np.float16)
with open('outfile.dat', 'w') as outfile:
    x.tofile(outfile)

请注意 "outfile.dat" 恰好是 2000 字节 - 每个项目两个字节。 tofile 只是将原始 "packed" 二进制数据写入磁盘。没有 header 等，使用它和 struct 模块之间的输出没有区别。

Binary storage of floating point values (between 0 and 1) using less than 4 bytes?