我必须压缩许多相似的文件，我可以利用它们相似的事实吗？

Question

我有一个包含许多不同样本（numpy 数组）的数据集。将所有内容仅存储在一个文件中是相当不切实际的，因此我存储了许多不同的 'npz' 文件（压缩为 zip 的 numpy 数组）。

现在我觉得，如果我能以某种方式利用所有文件彼此相似这一事实，我可以获得更高的压缩系数，这意味着我的磁盘占用空间要小得多。

是否可以单独存储一个'zip basis'？我的意思是所有文件一起计算并体现它们的统计特征并且需要解压，但在所有文件之间共享。

我会说 'zip basis' 文件和一个单独的压缩文件列表，这比单独压缩每个文件的大小要小得多，为了解压缩，我会使用共享 'zip basis'每个文件的时间。

技术上可行吗？有这样的东西吗？

Answer 1

tldr；这取决于每个文件的大小和其中的数据。例如，特征/用例/访问模式在 234567x100 字节文件和 100x234567 字节文件之间可能差异很大。

Now I feel that if I could somehow exploit the fact that all the files are similar to one another I could achieve a much higher compression factor, meaning a much smaller footprint on my disk.

可能吧。共享压缩的好处将随着文件大小的增加而减少。

无论如何，即使使用 Mono 文件实现（比方说标准 zip）可能节省显着有效的磁盘space对于很多非常小的文件，因为它避免了文件系统管理单个文件所需的开销；如果不出意外，许多实现必须与完整块对齐[例如。 512-4k 字节]。另外，使用普遍支持的格式免费压缩。

Is it possible to store separately a 'zip basis'? I mean something which is calculated for all the files together and embodies their statistical features and is needed for decompression, but is shared between all the files.

此 'zip basis' 有时称为预共享词典。

I would have said 'zip basis' file and a separate list of compressed files, which would be much smaller in size than each file zipped alone, and to decompress I would use the share 'zip basis' every time for each file.

Is it technically possible? Is there something that works like this?

是的，这是可能的。 SDCH (Shared Dictionary Compression for HTTP) 是为常见 Web 文件设计的此类实现（例如 HTTP/CSS/JavaScript）。 在某些情况下它可以获得比标准 DEFLATE 更高的压缩率。

可以使用许多压缩算法来模拟该方法，这些压缩算法适用于流，其中压缩字典被编码为写入流的一部分。（U = 未压缩，C = 压缩。）

要压缩：

[U:shared_dict] + [U:data] -> [C:shared_dict] + [C:data]
^-- "zip basis"                                 ^-- write only this to file
                              ^-- artifact of priming

解压：

[C:shared_dict] + [C:data] -> [U:shared_dict] + [U:data]
^-- add this back before decompressing!         ^-- use this

节省的总 space 取决于许多因素，包括初始启动字典的有用程度以及特定压缩器的详细信息。 LZ78-esque 实现特别适合上述方法，因为使用了用作查找字典的滑动 window。

或者，也可以使用特定领域的知识 and/or 编码来通过专门的压缩方案实现更好的压缩。这方面的一个例子是 SQL 服务器的页面压缩，它利用不同行上的列之间的数据相似性。

Answer 2

A 'zip-basis' 很有趣但有问题。

您可以改为预处理文件。以一个文件为模板，计算每个文件与模板的差异。然后压缩差异。

我必须压缩许多相似的文件，我可以利用它们相似的事实吗？

I must compress many similar files, can I exploit the fact they are similar?

python

compression

zip