为什么获取特定 git 提交比获取所有提交使用更多磁盘 space?

Why would fetching specific git commits use more disk space than fetching all?

如果我 运行 git fetch origin 然后 git checkout <revision> 在一系列连续的提交中,我得到一个相对较小的 repo 目录。

但是如果我 运行 git fetch origin <revision> 然后 git checkout FETCH_HEAD 在同一系列提交上,目录相对臃肿。具体来说,似乎有一堆大包文件。

无论提交是在第一次 fetch 时全部就位还是在每次获取之前立即提交,行为看起来都是一样的。

以下示例使用 public 存储库,因此您可以重现该行为。

为什么示例 2 的目录大小如此之大?

例1(小):

mkdir argo-cd
cd argo-cd/
git init
git remote add origin https://github.com/argoproj/argo-cd.git
git fetch origin
git checkout 497e53b0203638409e3083fa2ffac7d8fb3cce14
git fetch origin
git checkout 32be020af0f8bf6438201ee79b4d2b8037c57154
git fetch origin
git checkout 32d33dedcc70d94177384b235891b99d89497273
git fetch origin
git checkout 2e65b42f05bcc1401d1489e751993ec197f6942c
git fetch origin
git checkout b1ff9dbe1e3e3b2520e94eefc77d0322c765cd75
ls .git/objects/pack  # shows two files
du -h .  # current directory is 96M

例2(大):

cd ..
mkdir argo-cd-fetch
cd argo-cd-fetch/
git init
git remote add origin https://github.com/argoproj/argo-cd.git
git checkout FETCH_HEAD
git fetch origin 497e53b0203638409e3083fa2ffac7d8fb3cce14
git checkout FETCH_HEAD
git fetch origin 32be020af0f8bf6438201ee79b4d2b8037c57154
git checkout FETCH_HEAD
git fetch origin 32d33dedcc70d94177384b235891b99d89497273
git checkout FETCH_HEAD
git fetch origin 2e65b42f05bcc1401d1489e751993ec197f6942c
git checkout FETCH_HEAD
git fetch origin b1ff9dbe1e3e3b2520e94eefc77d0322c765cd75
git checkout FETCH_HEAD
ls .git/objects/pack. # shows ten files
du -sh .  # current directory is 244M

注意:我使用的是 git 2.32.0.

注意:该问题的灵感来自于 Argo CD (https://github.com/argoproj/argo-cd/pull/8897) 中的一个明显错误。这就是为什么我不只是 git gc 清理垃圾。

更新/澄清:

以下是每个示例的完整日志。但在这种情况下,我在 运行 下一个 git fetch 之前 立即 将每个提交推送到我的叉子。所以在这种情况下,我们知道初始获取并不是“获取所有内容”,让后续步骤基本上无事可做。

例1(小):

$ mkdir argo-cd-fork
~ $ cd argo-cd-fork/
~/argo-cd-fork $ git init
hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: of your new repositories, which will suppress this warning, call:
hint:
hint:   git config --global init.defaultBranch <name>
hint:
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint:
hint:   git branch -m <name>
Initialized empty Git repository in /Users/mcrenshaw/argo-cd-fork/.git/
~/argo-cd-fork (master|✔) $ git remote add origin https://github.com/crenshaw-dev/argo-cd.git

# Fetch 1

~/argo-cd-fork (master|✔) $ git fetch origin
remote: Enumerating objects: 83781, done.
remote: Counting objects: 100% (89/89), done.
remote: Compressing objects: 100% (62/62), done.
remote: Total 83781 (delta 60), reused 45 (delta 25), pack-reused 83692
Receiving objects: 100% (83781/83781), 60.99 MiB | 22.12 MiB/s, done.
Resolving deltas: 100% (52061/52061), done.
From https://github.com/crenshaw-dev/argo-cd
 * [new branch]          add-chart-field-to-application-yaml              -> origin/add-chart-field-to-application-yaml
... removed a bunch of branches and tags for brevity ...
 * [new tag]             v2.1.4                                           -> v2.1.4
~/argo-cd-fork (master|✔) $ du -sh .
 65M    .
~/argo-cd-fork (master|✔) $ git checkout afb1fe635ff7f5c435c5780ba665c72d5bc3c557
Note: switching to 'afb1fe635ff7f5c435c5780ba665c72d5bc3c557'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at afb1fe635 chore: fix unit test

# Fetch 2

~/argo-cd-fork ((afb1fe63…)|✔) $ git fetch origin
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 1 (delta 0), pack-reused 0
Unpacking objects: 100% (1/1), 161 bytes | 161.00 KiB/s, done.
From https://github.com/crenshaw-dev/argo-cd
   afb1fe635..f8fe71ab8  master     -> origin/master
~/argo-cd-fork ((afb1fe63…)|✔) $ git checkout f8fe71ab8f38095e296932b73f929bfbaf24f110
Previous HEAD position was afb1fe635 chore: fix unit test
HEAD is now at f8fe71ab8 test

# Fetch 3

~/argo-cd-fork ((f8fe71ab…)|✔) $ git fetch origin
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 1 (delta 0), pack-reused 0
Unpacking objects: 100% (1/1), 162 bytes | 81.00 KiB/s, done.
From https://github.com/crenshaw-dev/argo-cd
   f8fe71ab8..0363d622c  master     -> origin/master
~/argo-cd-fork ((f8fe71ab…)|✔) $ git checkout 0363d622c391947349689904f6b40209ff3123cd
Previous HEAD position was f8fe71ab8 test
HEAD is now at 0363d622c test

# Fetch 4

~/argo-cd-fork ((0363d622…)|✔) $ git fetch origin
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 1 (delta 0), pack-reused 0
Unpacking objects: 100% (1/1), 161 bytes | 161.00 KiB/s, done.
From https://github.com/crenshaw-dev/argo-cd
   0363d622c..4115a8c12  master     -> origin/master
~/argo-cd-fork ((0363d622…)|✔) $ git checkout 4115a8c1221751b1586caaf9871a0be12b5ce891
Previous HEAD position was 0363d622c test
HEAD is now at 4115a8c12 test

# Fetch 5

~/argo-cd-fork ((4115a8c1…)|✔) $ git fetch origin
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 1 (delta 0), pack-reused 0
Unpacking objects: 100% (1/1), 161 bytes | 161.00 KiB/s, done.
From https://github.com/crenshaw-dev/argo-cd
   4115a8c12..8f01aaddb  master     -> origin/master
~/argo-cd-fork ((4115a8c1…)|✔) $ git checkout 8f01aaddbaf4350217dcc84866275493b19308eb
Previous HEAD position was 4115a8c12 test
HEAD is now at 8f01aaddb test

~/argo-cd-fork ((8f01aadd…)|✔) $ du -sh .
 96M    .

例2(大):

 ~/argo-cd-fork ((8f01aadd…)|✔) $ cd ..
 ~ $ mkdir argo-cd-fork-2
 ~ $ cd argo-cd-fork-2
 ~/argo-cd-fork-2 [128]$ git init
 hint: Using 'master' as the name for the initial branch. This default branch name
 hint: is subject to change. To configure the initial branch name to use in all
 hint: of your new repositories, which will suppress this warning, call:
 hint:
 hint:   git config --global init.defaultBranch <name>
 hint:
 hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
 hint: 'development'. The just-created branch can be renamed via this command:
 hint:
 hint:   git branch -m <name>
 Initialized empty Git repository in /Users/mcrenshaw/argo-cd-fork-2/.git/
 ~/argo-cd-fork-2 (master|✔) $ git remote add origin https://github.com/crenshaw-dev/argo-cd.git

# Fetch 1

 ~/argo-cd-fork-2 (master|✔) $ git fetch origin 8f01aaddbaf4350217dcc84866275493b19308eb
 remote: Enumerating objects: 47713, done.
 remote: Counting objects: 100% (4/4), done.
 remote: Compressing objects: 100% (4/4), done.
 remote: Total 47713 (delta 3), reused 1 (delta 0), pack-reused 47709
 Receiving objects: 100% (47713/47713), 40.90 MiB | 26.40 MiB/s, done.
 Resolving deltas: 100% (31970/31970), done.
 From https://github.com/crenshaw-dev/argo-cd
  * branch              8f01aaddbaf4350217dcc84866275493b19308eb -> FETCH_HEAD
 ~/argo-cd-fork-2 (master|✔) $ git checkout FETCH_HEAD
 Note: switching to 'FETCH_HEAD'.

 You are in 'detached HEAD' state. You can look around, make experimental
 changes and commit them, and you can discard any commits you make in this
 state without impacting any branches by switching back to a branch.

 If you want to create a new branch to retain commits you create, you may
 do so (now or later) by using -c with the switch command. Example:

   git switch -c <new-branch-name>

 Or undo this operation with:

   git switch -

 Turn off this advice by setting config variable advice.detachedHead to false

 HEAD is now at 8f01aadd test

# Fetch 2

 ~/argo-cd-fork-2 ((8f01aadd…)|✔) $ git fetch origin 3fad137f5dcd8ebdb504a8b8de0138fb92d76458
 remote: Enumerating objects: 47714, done.
 remote: Counting objects: 100% (5/5), done.
 remote: Compressing objects: 100% (5/5), done.
 remote: Total 47714 (delta 4), reused 1 (delta 0), pack-reused 47709
 Receiving objects: 100% (47714/47714), 40.90 MiB | 19.89 MiB/s, done.
 Resolving deltas: 100% (31971/31971), done.
 From https://github.com/crenshaw-dev/argo-cd
  * branch                3fad137f5dcd8ebdb504a8b8de0138fb92d76458 -> FETCH_HEAD
 ~/argo-cd-fork-2 ((8f01aadd…)|✔) $ git checkout FETCH_HEAD
 Previous HEAD position was 8f01aaddb test
 HEAD is now at 3fad137f5 test

# Fetch 3

 ~/argo-cd-fork-2 ((3fad137f…)|✔) $ git fetch origin a94ab16b0964c2b583f8b923ad5a84b2a6b2b716
 remote: Enumerating objects: 47715, done.
 remote: Counting objects: 100% (6/6), done.
 remote: Compressing objects: 100% (6/6), done.
 remote: Total 47715 (delta 5), reused 1 (delta 0), pack-reused 47709
 Receiving objects: 100% (47715/47715), 40.90 MiB | 5.89 MiB/s, done.
 Resolving deltas: 100% (31972/31972), done.
 From https://github.com/crenshaw-dev/argo-cd
  * branch                a94ab16b0964c2b583f8b923ad5a84b2a6b2b716 -> FETCH_HEAD
 ~/argo-cd-fork-2 ((3fad137f…)|✔) $ git checkout FETCH_HEAD
 Previous HEAD position was 3fad137f5 test
 HEAD is now at a94ab16b0 test

# Fetch 4

 ~/argo-cd-fork-2 ((a94ab16b…)|✔) $ git fetch origin bf651bfc6653b6cf13a522d590a8779fc3b66a77
 remote: Enumerating objects: 47716, done.
 remote: Counting objects: 100% (7/7), done.
 remote: Compressing objects: 100% (7/7), done.
 remote: Total 47716 (delta 6), reused 1 (delta 0), pack-reused 47709
 Receiving objects: 100% (47716/47716), 40.90 MiB | 7.31 MiB/s, done.
 Resolving deltas: 100% (31973/31973), done.
 From https://github.com/crenshaw-dev/argo-cd
  * branch                bf651bfc6653b6cf13a522d590a8779fc3b66a77 -> FETCH_HEAD
 ~/argo-cd-fork-2 ((a94ab16b…)|✔) $ git checkout FETCH_HEAD
 Previous HEAD position was a94ab16b0 test
 HEAD is now at bf651bfc6 test

# Fetch 5

 ~/argo-cd-fork-2 ((bf651bfc…)|✔) $ git fetch origin 81895cf2a3f6e030aef7ddadc390b7a7743af03d
 remote: Enumerating objects: 47717, done.
 remote: Counting objects: 100% (8/8), done.
 remote: Compressing objects: 100% (8/8), done.
 remote: Total 47717 (delta 7), reused 1 (delta 0), pack-reused 47709
 Receiving objects: 100% (47717/47717), 41.00 MiB | 9.17 MiB/s, done.
 Resolving deltas: 100% (32005/32005), done.
 From https://github.com/crenshaw-dev/argo-cd
  * branch                81895cf2a3f6e030aef7ddadc390b7a7743af03d -> FETCH_HEAD
 ~/argo-cd-fork-2 ((bf651bfc…)|✔) $ git checkout FETCH_HEAD
 Previous HEAD position was bf651bfc6 test
 HEAD is now at 81895cf2a test

 ~/argo-cd-fork-2 ((81895cf2…)|✔) $ du -sh .
 242M    .

因为每次提取都会产生自己的包文件,而一个包文件比多个包文件更有效。更有效率。怎么样?

首先,结帐是一个转移注意力的问题。它们不会影响 .git/ 目录的大小。

其次,在第一个示例中,只有第一个 git fetch origin 可以执行任何操作。其余的将一无所获(除非原点发生了变化)。

为什么多个 packfile 效率较低?

压缩的工作原理是在数据中找到常见的长序列并将它们缩减为非常短的序列。如果 <div>long block of legal mumbo jumbo</div> 出现几十次,可以用几个字节代替。但是原来的长串还是要存的。如果只有一个包文件,则它只能存储一次。如果有多个包文件,则必须多次存储。实际上,您是在每个包文件中存储到那时为止的整个更改历史记录。

我们在下面的例子中可以看到,第一个packfile是113M,第二个是161M,第三个是177M,最后fetch是209M。最终打包文件的大小大致等于单个垃圾压缩打包文件的大小。

为什么多次提取会导致多个包文件?

git fetch 效率很高。它只会获取您还没有的对象。发送单个目标文件效率低下。智能 Git 服务器会将它们作为单个包文件发送。

当您在新存储库上执行单个 git fetch 时,Git 会向服务器询问每个对象。远程向它发送每个对象的包文件。

当你执行 git fetch ABC 然后 git fetch DEFs 时,Git 告诉服务器“我已经拥有了 ABC 的所有对象,给我所有的对象到 DEF”,所以服务器为从 ABC 到 DEF 的所有内容创建一个新的包文件并发送它。

最终您的存储库将执行自动垃圾收集并将它们重新打包到一个包文件中。


我们可以减少例子。我将使用 Rails 来说明,因为它有明确定义的要获取的标签。

git init
git remote add origin https://github.com/rails/rails.git
git fetch origin
du -sh .git/objects/pack/*
22M .git/objects/pack/pack-ef0a91833c4774a28a21c814a26e04043621512d.idx
209M    .git/objects/pack/pack-ef0a91833c4774a28a21c814a26e04043621512d.pack

和:

git init
git remote add origin https://github.com/rails/rails.git

git fetch origin v5.0.0
du -sh .git/objects/pack/*
13M .git/objects/pack/pack-7be7f8792d634f63a623e50165a11983e7cdaeef.idx
113M    .git/objects/pack/pack-7be7f8792d634f63a623e50165a11983e7cdaeef.pack

git fetch origin v6.0.0
du -sh .git/objects/pack/*
13M .git/objects/pack/pack-7be7f8792d634f63a623e50165a11983e7cdaeef.idx
113M    .git/objects/pack/pack-7be7f8792d634f63a623e50165a11983e7cdaeef.pack
16M .git/objects/pack/pack-c81c5343636211ffcc9ffdfeeb3bb65b9cba75df.idx
161M    .git/objects/pack/pack-c81c5343636211ffcc9ffdfeeb3bb65b9cba75df.pack

git fetch origin v7.0.0
du -sh .git/objects/pack/*
18M .git/objects/pack/pack-2d2066f04670f137265fed0f382ad0d6f0dd9f3e.idx
177M    .git/objects/pack/pack-2d2066f04670f137265fed0f382ad0d6f0dd9f3e.pack
13M .git/objects/pack/pack-7be7f8792d634f63a623e50165a11983e7cdaeef.idx
113M    .git/objects/pack/pack-7be7f8792d634f63a623e50165a11983e7cdaeef.pack
16M .git/objects/pack/pack-c81c5343636211ffcc9ffdfeeb3bb65b9cba75df.idx
161M    .git/objects/pack/pack-c81c5343636211ffcc9ffdfeeb3bb65b9cba75df.pack

git fetch origin
du -sh .git/objects/pack/*
18M .git/objects/pack/pack-2d2066f04670f137265fed0f382ad0d6f0dd9f3e.idx
177M    .git/objects/pack/pack-2d2066f04670f137265fed0f382ad0d6f0dd9f3e.pack
13M .git/objects/pack/pack-7be7f8792d634f63a623e50165a11983e7cdaeef.idx
113M    .git/objects/pack/pack-7be7f8792d634f63a623e50165a11983e7cdaeef.pack
22M .git/objects/pack/pack-b28e1368cf8e1ee0152e7dd7b328760c5b589c40.idx
209M    .git/objects/pack/pack-b28e1368cf8e1ee0152e7dd7b328760c5b589c40.pack
16M .git/objects/pack/pack-c81c5343636211ffcc9ffdfeeb3bb65b9cba75df.idx
161M    .git/objects/pack/pack-c81c5343636211ffcc9ffdfeeb3bb65b9cba75df.pack

在垃圾收集之后,所有这些都被收集到一个与单个提取大小大致相同的包文件中。

git gc
du -sh .git/objects/pack/*
22M .git/objects/pack/pack-7f1d7066fb6c5bd6a47749b215c020fab5ca416b.idx
212M    .git/objects/pack/pack-7f1d7066fb6c5bd6a47749b215c020fab5ca416b.pack