git 何时以及如何使用增量进行存储?
When and how does git use deltas for storage?
阅读 git 的文档 他们非常强调的一件事是 git 存储快照而不是增量。因为我在 Git 上看到一门课程说 Git 存储文件版本之间的差异,所以我尝试了以下操作:我在一个空文件夹上初始化了一个 git 存储库,创建了一个文件 lorem.txt
包含一些 lorem ipsum 文本暂存文件并提交。
然后在命令行上使用 find .git/objects -type f
我列出了 git 保存在对象文件夹中的内容,并且如预期的那样找到了一个指向树对象的提交对象,该树对象指向包含 lorem ispum 文本的 blob 对象我存了。
然后我修改了 lorem ipsum 文本,向其中添加了更多内容,进行了此更改并提交。再次列出文件,我现在可以看到新的提交对象,指向一个新的三对象和一个新的 blob 对象。使用 git cat-file -p 331cf0780688c73be429fa602f9dd99f18b36793
我可以看到新 blob 的内容。它们正是完整 lorem.txt
文件的内容,旧内容加上更改。
这按照文档的预期工作:git 存储快照,而不是增量。但是,在互联网上搜索我发现 this SO question。在接受的答案中,我们看到以下内容:
While that's true and important on the conceptual level, it is NOT true at the storage level.
Git does use deltas for storage.
Not only that, but it's more efficient in it than any other system. Because it does not keep per-file history, when it wants to do delta-compression, it takes each blob, selects some blobs that are likely to be similar (using heuristics that includes the closest approximation of previous version and some others), tries to generate the deltas and picks the smallest one. This way it can (often, depends on the heuristics) take advantage of other similar files or older versions that are more similar than the previous. The "pack window" parameter allows trading performance for delta compression quality. The default (10) generally gives decent results, but when space is limited or to speed up network transfers, git gc --aggressive uses value 250, which makes it run very slow, but provide extra compression for history data.
也就是说 Git 确实使用增量进行存储。据我了解,Git 不会一直使用增量,但只有在它检测到有必要时才使用。这是真的吗?
我在文件中放置了很多 lorem 文本,因此它的大小为 2mb。我认为当对大文本文件进行小的更改时 Git 会自动使用增量,但正如我所说,它不会。
何时 Git 使用增量,结果如何?
Git 仅在 "packfiles" 中使用增量。最初,每个 git 对象都被写成一个单独的文件(如您所见)。后来,git 可以将许多对象打包到一个文件中,称为 "pack file"。然后压缩包文件,它会自动利用包文件中文件之间的任何重复(或文件内的重复)。
此打包由 git repack
执行。您可以通过手动调用它来查看它的运行情况。如果您 运行 git repack -ad
在 git 存储库上,您应该看到已用磁盘 space 和 .git/objects
下的文件数量下降,因为文件被组合成包和压缩。
实际上,您通常不需要 运行 git repack
。 Git 默认情况下定期 运行s git gc
,这在必要时又 运行s git repack
。所以放松,git 支持你 :-)。
优秀的"git book"也有一章是关于packfiles的,解释比较多:
http://git-scm.com/book/en/v2/Git-Internals-Packfiles .
Git 2.18(2018 年第 2 季度)记录了 Documentation/technical/pack-format
中的增量使用情况
参见 commit 011b648 (11 May 2018) by Nguyễn Thái Ngọc Duy (pclouds
)。
(由 Junio C Hamano -- gitster
-- in commit b577198 合并,2018 年 5 月 23 日)
pack-format.txt: more details on pack file format
The current document mentions OBJ_*
constants without their actual
values. A git developer would know these are from cache.h
but that's
not very friendly to a person who wants to read this file to implement
a pack file parser.
Similarly, the deltified representation is not documented at all (the
"document" is basically patch-delta.c). Translate that C code to
English with a bit more about what ofs-delta
and ref-delta
mean.
因此文档现在指出:
Object types
Valid object types are:
OBJ_COMMIT
(1)
OBJ_TREE
(2)
OBJ_BLOB
(3)
OBJ_TAG
(4)
OBJ_OFS_DELTA
(6)
OBJ_REF_DELTA
(7)
Type 5 is reserved for future expansion. Type 0 is invalid.
Deltified representation
Conceptually there are only four object types: commit, tree, tag and
blob.
However to save space, an object could be stored as a "delta" of
another "base" object.
These representations are assigned new types ofs-delta and ref-delta, which is only valid in a pack file.
Both ofs-delta
and ref-delta
store the "delta" to be applied to
another object (called 'base object') to reconstruct the object.
The difference between them is,
- ref-delta directly encodes 20-byte base object name.
- If the base object is in the same pack, ofs-delta encodes the offset of the base object in the pack instead.
The base object could also be deltified if it's in the same pack.
Ref-delta can also refer to an object outside the pack (i.e. the
so-called "thin pack"). When stored on disk however, the pack should
be self contained to avoid cyclic dependency.
The delta data is a sequence of instructions to reconstruct an object
from the base object.
If the base object is deltified, it must be converted to canonical form first. Each instruction appends more and more data to the target object until it's complete.
There are two supported instructions so far:
- one for copy a byte range from the source object and
- one for inserting new data embedded in the instruction itself.
Each instruction has variable length. Instruction type is determined
by the seventh bit of the first octet. The following diagrams follow
the convention in RFC 1951 (Deflate compressed data format).
使用 Git 2.20(2018 年第 4 季度),packstream 中格式错误或精心制作的数据会使我们的代码尝试读取或写入超过分配的缓冲区并中止,而不是
报告错误,已修复。
t5303
: use printf
to generate delta bases
The exact byte count of the delta base file is important.
The test-delta
helper will feed it to patch_delta()
, which will barf if it doesn't match the size byte given in the delta.
Using "echo
" may end up with unexpected line endings on some platforms (e.g,. "\r\n
" instead of just "\n
").
This actually wouldn't cause the test to fail (since we already expect test-delta to complain about these bogus deltas), but would mean that we're not exercising the code we think we are.
Let's use printf
instead (which we already trust to give us
byte-perfect output when we generate the deltas).
使用 Git 2.25(2020 年第一季度),在包含许多包文件的存储库中,通过使用低效的搜索算法避免两次注册相同包文件的过程的成本不必要地高,已得到纠正.
参见 commit ec48540 (27 Nov 2019) by Colin Stolley (ccstolley
)。
(由 Junio C Hamano -- gitster
-- in commit 6d831b8 合并,2019 年 12 月 16 日)
packfile.c
: speed up loading lots of packfiles
Signed-off-by: Colin Stolley
Helped-by: Jeff King
When loading packfiles on start-up, we traverse the internal packfile list once per file to avoid reloading packfiles that have already been loaded. This check runs in quadratic time, so for poorly maintained repos with a large number of packfiles, it can be pretty slow.
Add a hashmap containing the packfile names as we load them so that the average runtime cost of checking for already-loaded packs becomes constant.
Add a perf test to p5303 to show speed-up.
The existing p5303 test runtimes are dominated by other factors and do not show an appreciable speed-up.
The new test in p5303 clearly exposes a speed-up in bad cases.
In this test we create 10,000 packfiles and measure the start-up time of git rev-parse
, which does little else besides load in the packs.
Here are the numbers for the new p5303 test:
Test HEAD^ HEAD
---------------------------------------------------------------------
5303.12: load 10,000 packs 1.03(0.92+0.10) 0.12(0.02+0.09) -88.3%
[jc: squashed the change to call hashmap in install_packed_git()
by peff]
Signed-off-by: Junio C Hamano
阅读 git 的文档 他们非常强调的一件事是 git 存储快照而不是增量。因为我在 Git 上看到一门课程说 Git 存储文件版本之间的差异,所以我尝试了以下操作:我在一个空文件夹上初始化了一个 git 存储库,创建了一个文件 lorem.txt
包含一些 lorem ipsum 文本暂存文件并提交。
然后在命令行上使用 find .git/objects -type f
我列出了 git 保存在对象文件夹中的内容,并且如预期的那样找到了一个指向树对象的提交对象,该树对象指向包含 lorem ispum 文本的 blob 对象我存了。
然后我修改了 lorem ipsum 文本,向其中添加了更多内容,进行了此更改并提交。再次列出文件,我现在可以看到新的提交对象,指向一个新的三对象和一个新的 blob 对象。使用 git cat-file -p 331cf0780688c73be429fa602f9dd99f18b36793
我可以看到新 blob 的内容。它们正是完整 lorem.txt
文件的内容,旧内容加上更改。
这按照文档的预期工作:git 存储快照,而不是增量。但是,在互联网上搜索我发现 this SO question。在接受的答案中,我们看到以下内容:
While that's true and important on the conceptual level, it is NOT true at the storage level.
Git does use deltas for storage.
Not only that, but it's more efficient in it than any other system. Because it does not keep per-file history, when it wants to do delta-compression, it takes each blob, selects some blobs that are likely to be similar (using heuristics that includes the closest approximation of previous version and some others), tries to generate the deltas and picks the smallest one. This way it can (often, depends on the heuristics) take advantage of other similar files or older versions that are more similar than the previous. The "pack window" parameter allows trading performance for delta compression quality. The default (10) generally gives decent results, but when space is limited or to speed up network transfers, git gc --aggressive uses value 250, which makes it run very slow, but provide extra compression for history data.
也就是说 Git 确实使用增量进行存储。据我了解,Git 不会一直使用增量,但只有在它检测到有必要时才使用。这是真的吗?
我在文件中放置了很多 lorem 文本,因此它的大小为 2mb。我认为当对大文本文件进行小的更改时 Git 会自动使用增量,但正如我所说,它不会。
何时 Git 使用增量,结果如何?
Git 仅在 "packfiles" 中使用增量。最初,每个 git 对象都被写成一个单独的文件(如您所见)。后来,git 可以将许多对象打包到一个文件中,称为 "pack file"。然后压缩包文件,它会自动利用包文件中文件之间的任何重复(或文件内的重复)。
此打包由 git repack
执行。您可以通过手动调用它来查看它的运行情况。如果您 运行 git repack -ad
在 git 存储库上,您应该看到已用磁盘 space 和 .git/objects
下的文件数量下降,因为文件被组合成包和压缩。
实际上,您通常不需要 运行 git repack
。 Git 默认情况下定期 运行s git gc
,这在必要时又 运行s git repack
。所以放松,git 支持你 :-)。
优秀的"git book"也有一章是关于packfiles的,解释比较多: http://git-scm.com/book/en/v2/Git-Internals-Packfiles .
Git 2.18(2018 年第 2 季度)记录了 Documentation/technical/pack-format
参见 commit 011b648 (11 May 2018) by Nguyễn Thái Ngọc Duy (pclouds
)。
(由 Junio C Hamano -- gitster
-- in commit b577198 合并,2018 年 5 月 23 日)
pack-format.txt: more details on pack file format
The current document mentions
OBJ_*
constants without their actual values. A git developer would know these are fromcache.h
but that's not very friendly to a person who wants to read this file to implement a pack file parser.Similarly, the deltified representation is not documented at all (the "document" is basically patch-delta.c). Translate that C code to English with a bit more about what
ofs-delta
andref-delta
mean.
因此文档现在指出:
Object types
Valid object types are:
OBJ_COMMIT
(1)OBJ_TREE
(2)OBJ_BLOB
(3)OBJ_TAG
(4)OBJ_OFS_DELTA
(6)OBJ_REF_DELTA
(7)Type 5 is reserved for future expansion. Type 0 is invalid.
Deltified representation
Conceptually there are only four object types: commit, tree, tag and blob.
However to save space, an object could be stored as a "delta" of another "base" object.
These representations are assigned new types ofs-delta and ref-delta, which is only valid in a pack file.Both
ofs-delta
andref-delta
store the "delta" to be applied to another object (called 'base object') to reconstruct the object.
The difference between them is,
- ref-delta directly encodes 20-byte base object name.
- If the base object is in the same pack, ofs-delta encodes the offset of the base object in the pack instead.
The base object could also be deltified if it's in the same pack.
Ref-delta can also refer to an object outside the pack (i.e. the so-called "thin pack"). When stored on disk however, the pack should be self contained to avoid cyclic dependency.The delta data is a sequence of instructions to reconstruct an object from the base object.
If the base object is deltified, it must be converted to canonical form first. Each instruction appends more and more data to the target object until it's complete.
There are two supported instructions so far:
- one for copy a byte range from the source object and
- one for inserting new data embedded in the instruction itself.
Each instruction has variable length. Instruction type is determined by the seventh bit of the first octet. The following diagrams follow the convention in RFC 1951 (Deflate compressed data format).
使用 Git 2.20(2018 年第 4 季度),packstream 中格式错误或精心制作的数据会使我们的代码尝试读取或写入超过分配的缓冲区并中止,而不是 报告错误,已修复。
t5303
: useprintf
to generate delta basesThe exact byte count of the delta base file is important.
Thetest-delta
helper will feed it topatch_delta()
, which will barf if it doesn't match the size byte given in the delta.
Using "echo
" may end up with unexpected line endings on some platforms (e.g,. "\r\n
" instead of just "\n
").This actually wouldn't cause the test to fail (since we already expect test-delta to complain about these bogus deltas), but would mean that we're not exercising the code we think we are.
Let's use
printf
instead (which we already trust to give us byte-perfect output when we generate the deltas).
使用 Git 2.25(2020 年第一季度),在包含许多包文件的存储库中,通过使用低效的搜索算法避免两次注册相同包文件的过程的成本不必要地高,已得到纠正.
参见 commit ec48540 (27 Nov 2019) by Colin Stolley (ccstolley
)。
(由 Junio C Hamano -- gitster
-- in commit 6d831b8 合并,2019 年 12 月 16 日)
packfile.c
: speed up loading lots of packfilesSigned-off-by: Colin Stolley
Helped-by: Jeff KingWhen loading packfiles on start-up, we traverse the internal packfile list once per file to avoid reloading packfiles that have already been loaded. This check runs in quadratic time, so for poorly maintained repos with a large number of packfiles, it can be pretty slow.
Add a hashmap containing the packfile names as we load them so that the average runtime cost of checking for already-loaded packs becomes constant.
Add a perf test to p5303 to show speed-up.
The existing p5303 test runtimes are dominated by other factors and do not show an appreciable speed-up.
The new test in p5303 clearly exposes a speed-up in bad cases.
In this test we create 10,000 packfiles and measure the start-up time ofgit rev-parse
, which does little else besides load in the packs.Here are the numbers for the new p5303 test:
Test HEAD^ HEAD --------------------------------------------------------------------- 5303.12: load 10,000 packs 1.03(0.92+0.10) 0.12(0.02+0.09) -88.3%
[jc: squashed the change to call hashmap in
install_packed_git()
by peff]
Signed-off-by: Junio C Hamano