如何提高 git 日志性能?
How to improve git log performance?
我正在尝试从这样的几个存储库中提取 git 日志:
git log --pretty=format:%H\t%ae\t%an\t%at\t%s --numstat
对于较大的存储库(如 rails/rails),生成日志需要 35 秒以上的时间。
有没有办法提高这种性能?
我的第一个想法是改善您的 IO,但我使用 SSD 针对 rails 存储库进行了测试,得到了类似的结果:30 秒。
--numstat
是导致一切变慢的原因,否则 git-log
即使格式化也可以在 1 秒内完成。做一个差异是昂贵的,所以如果你能从你的过程中删除它,那将极大地加快速度。也许事后再做。
否则,如果您使用 git-log
自己的搜索工具过滤日志条目,这将减少需要进行比较的条目数量。例如,git log --grep=foo --numstat
只需一秒钟。They're in the docs under "Commit Limiting"。这可以大大减少 git 必须格式化的条目数。修订范围、日期过滤器、作者过滤器、日志消息 grepping...所有这些都可以提高 git-log
在大型存储库上的性能,同时执行昂贵的操作。
你是对的,它确实需要 20 到 35 秒才能生成关于 56'000 次提交的报告,生成 224'000 行 (15MiB) 的输出。我实际上认为这是相当不错的表现,但你不认为;好的
因为您是从不变的数据库中使用固定格式生成报告,所以您只需执行一次。之后就可以使用git log
的缓存结果,跳过耗时生成。例如:
git log --pretty=format:%H\t%ae\t%an\t%at\t%s --numstat > log-pretty.txt
您可能想知道在整个报告中搜索感兴趣的数据需要多长时间。这是一个有价值的问题:
$ tail -1 log-pretty.txt
30 0 railties/test/webrick_dispatcher_test.rb
$ time grep railties/test/webrick_dispatcher_test.rb log-pretty.txt
…
30 0 railties/test/webrick_dispatcher_test.rb
real 0m0.012s
…
不错,"cache" 的引入将所需时间从 35+ 秒减少到十几毫秒。这几乎快了 3000 倍。
TLDR;作为 mentioned in GitMerge 2019:
git config --global core.commitGraph true
git config --global gc.writeCommitGraph true
cd /path/to/repo
git commit-graph write
实际上(见最后),Git 2.24+(2019 年第 3 季度)不需要前两个配置:默认情况下它们是 true
。
如T4cC0re mentions in :
If you are on git version 2.29 or above you should rather run:
git commit-graph write --reachable --changed-paths
This will pre-compute file paths, so that git log
commands that are scoped to files also benefit from this cache.
Git 2.18(2018 年第 2 季度)将提高 git log
性能:
参见 commit 902f5a2 (24 Mar 2018) by René Scharfe (rscharfe
)。
参见 commit 0aaf05b, commit 3d475f4 (22 Mar 2018) by Derrick Stolee (derrickstolee
)。
参见 commit 626fd98 (22 Mar 2018) by brian m. carlson (bk2204
)。
(由 Junio C Hamano -- gitster
-- in commit 51f813c 合并,2018 年 4 月 10 日)
sha1_name
: use bsearch_pack()
for abbreviations
When computing abbreviation lengths for an object ID against a single
packfile, the method find_abbrev_len_for_pack()
currently implements
binary search.
This is one of several implementations.
One issue with this implementation is that it ignores the fanout table in the pack-index
.
Translate this binary search to use the existing bsearch_pack()
method
that correctly uses a fanout table.
Due to the use of the fanout table, the abbreviation computation is
slightly faster than before.
For a fully-repacked copy of the Linux repo, the following 'git log' commands improved:
* git log --oneline --parents --raw
Before: 59.2s
After: 56.9s
Rel %: -3.8%
* git log --oneline --parents
Before: 6.48s
After: 5.91s
Rel %: -8.9%
同Git 2.18增加了一个commits graph:预先计算并将祖先遍历所需的信息存储在一个单独的文件中以优化graph walking。
参见 commit 7547b95, commit 3d5df01, commit 049d51a, commit 177722b, commit 4f2542b, commit 1b70dfd, commit 2a2e32b (10 Apr 2018), and commit f237c8b, commit 08fd81c, commit 4ce58ee, commit ae30d7b, commit b84f767, commit cfe8321, commit f2af9f5 (02 Apr 2018) by Derrick Stolee (derrickstolee
)。
(由 Junio C Hamano -- gitster
-- in commit b10edb2 合并,2018 年 5 月 8 日)
commit
: integrate commit graph with commit parsing
Teach Git to inspect a commit graph file to supply the contents of a
struct commit when calling parse_commit_gently()
.
This implementation satisfies all post-conditions on the struct commit, including loading parents, the root tree, and the commit date.
If core.commitGraph
is false
, then do not check graph files.
In test script t5318-commit-graph.sh, add output-matching
conditions on
read-only graph operations.
By loading commits from the graph instead of parsing commit buffers, we
save a lot of time on long commit walks.
Here are some performance results for a copy of the Linux repository where 'master' has 678,653 reachable commits and is behind 'origin/master
' by 59,929 commits.
| Command | Before | After | Rel % |
|----------------------------------|--------|--------|-------|
| log --oneline --topo-order -1000 | 8.31s | 0.94s | -88% |
| branch -vv | 1.02s | 0.14s | -86% |
| rev-list --all | 5.89s | 1.07s | -81% |
| rev-list --all --objects | 66.15s | 58.45s | -11% |
要了解有关提交图的更多信息,请参阅“How does 'git log --graph
' work?”。
相同 Git 2.18(2018 年第 2 季度)添加了延迟加载树。
代码已学会使用存储的重复信息
在提交图文件中学习提交的树对象名称
避免在有意义时打开和解析提交对象
这样做。
参见commit 279ffad (30 Apr 2018) by SZEDER Gábor (szeder
)。
参见 commit 7b8a21d, commit 2e27bd7, commit 5bb03de, commit 891435d (06 Apr 2018) by Derrick Stolee (derrickstolee
)。
(由 Junio C Hamano -- gitster
-- in commit c89b6e1 合并,2018 年 5 月 23 日)
commit-graph
: lazy-load trees for commits
The commit-graph file provides quick access to commit data, including
the OID of the root tree for each commit in the graph. When performing
a deep commit-graph walk, we may not need to load most of the trees
for these commits.
Delay loading the tree object for a commit loaded from the graph
until requested via get_commit_tree()
.
Do not lazy-load trees for commits not in the graph, since that requires duplicate parsing and the relative peformance improvement when trees are not needed is small.
On the Linux repository, performance tests were run for the following
command:
git log --graph --oneline -1000
Before: 0.92s
After: 0.66s
Rel %: -28.3%
Git 2.21(2019 年第一季度)添加了 松散缓存 .
参见 commit 8be88db (07 Jan 2019), and commit 4cea1ce, commit d4e19e5, commit 0000d65 (06 Jan 2019) by René Scharfe (rscharfe
)。
(由 Junio C Hamano -- gitster
-- in commit eb8638a 合并,2019 年 1 月 18 日)
object-store
: use one oid_array
per subdirectory for loose cache
The loose objects cache is filled one subdirectory at a time as needed.
It is stored in an oid_array
, which has to be resorted after each add operation.
So when querying a wide range of objects, the partially filled array needs to be resorted up to 255 times, which takes over 100 times longer than sorting once.
Use one oid_array
for each subdirectory.
This ensures that entries have to only be sorted a single time. It also avoids eight binary search steps for each cache lookup as a small bonus.
The cache is used for collision checks for the log placeholders %h
, %t
and %p
, and we can see the change speeding them up in a repository with ca. 100 objects per subdirectory:
$ git count-objects
26733 objects, 68808 kilobytes
Test HEAD^ HEAD
--------------------------------------------------------------------
4205.1: log with %H 0.51(0.47+0.04) 0.51(0.49+0.02) +0.0%
4205.2: log with %h 0.84(0.82+0.02) 0.60(0.57+0.03) -28.6%
4205.3: log with %T 0.53(0.49+0.04) 0.52(0.48+0.03) -1.9%
4205.4: log with %t 0.84(0.80+0.04) 0.60(0.59+0.01) -28.6%
4205.5: log with %P 0.52(0.48+0.03) 0.51(0.50+0.01) -1.9%
4205.6: log with %p 0.85(0.78+0.06) 0.61(0.56+0.05) -28.2%
4205.7: log with %h-%h-%h 0.96(0.92+0.03) 0.69(0.64+0.04) -28.1%
Git 2.22(2019 年 4 月)在使用从提交图文件读取的数据之前检查错误。
参见 commit 93b4405, commit 43d3561, commit 7b8ce9c, commit 67a530f, commit 61df89c, commit 2ac138d (25 Mar 2019), and commit 945944c, commit f6761fa (21 Feb 2019) by Ævar Arnfjörð Bjarmason (avar
)。
(由 Junio C Hamano -- gitster
-- in commit a5e4be2 合并,2019 年 4 月 25 日)
commit-graph
write: don't die if the existing graph is corrupt
When the commit-graph
is written we end up calling parse_commit()
. This will in turn invoke code that'll consult the existing commit-graph
about the commit, if the graph is corrupted we die.
We thus get into a state where a failing "commit-graph verify
" can't be followed-up with a "commit-graph write
" if core.commitGraph=true
is set, the graph either needs to be manually removed to proceed, or core.commitGraph
needs to be set to "false".
Change the "commit-graph write
" codepath to use a new parse_commit_no_graph()
helper instead of parse_commit()
to avoid this.
The latter will call repo_parse_commit_internal()
with use_commit_graph=1
as seen in 177722b ("commit
: integrate commit graph with commit parsing", 2018-04-10, Git v2.18.0-rc0).
Not using the old graph at all slows down the writing of the new graph by some small amount, but is a sensible way to prevent an error in the existing commit-graph from spreading.
使用 Git 2.24+(2019 年第 3 季度),提交图默认处于活动状态:
参见commit aaf633c, commit c6cc4c5, commit ad0fb65, commit 31b1de6, commit b068d9a, commit 7211b9e (13 Aug 2019) by Derrick Stolee (derrickstolee
)。
(由 Junio C Hamano -- gitster
-- in commit f4f8dfe 合并,2019 年 9 月 9 日)
commit-graph
: turn on commit-graph by default
The commit-graph feature has seen a lot of activity in the past
year or so since it was introduced.
The feature is a critical performance enhancement for medium- to large-sized repos, and does not significantly hurt small repos.
Change the defaults for core.commitGraph
and gc.writeCommitGraph
to true so users benefit from this feature by default.
仍然使用 Git 2.24(2019 年第 4 季度),配置变量告诉“git fetch
”在完成后写入提交图。
参见 commit 50f26bd (03 Sep 2019) by Derrick Stolee (derrickstolee
)。
(由 Junio C Hamano -- gitster
-- in commit 5a53509 合并,2019 年 9 月 30 日)
fetch: add fetch.writeCommitGraph config setting
The commit-graph feature is now on by default, and is being written during 'git gc
' by default.
Typically, Git only writes a commit-graph when a 'git gc --auto
' command passes the gc.auto
setting to actualy do work. This means that a commit-graph will
typically fall behind the commits that are being used every day.
To stay updated with the latest commits, add a step to 'git fetch
' to write a commit-graph after fetching new objects.
The fetch.writeCommitGraph
config setting enables writing a split commit-graph, so on average the cost of writing this file is very small. Occasionally, the commit-graph chain will collapse to a single level, and this could be slow for very large repos.
For additional use, adjust the default to be true when feature.experimental
is enabled.
并且 Git 2.24(2019 年第 4 季度)仍然如此,commit-graph
更加稳健。
参见 commit 6abada1, commit fbab552 (12 Sep 2019) by Jeff King (peff
)。
(由 Junio C Hamano -- gitster
-- in commit 098e8c6 合并,2019 年 10 月 7 日)
commit-graph
: bump DIE_ON_LOAD
check to actual load-time
Commit 43d3561 (commit-graph write: don't die if the existing graph
is corrupt, 2019-03-25, Git v2.22.0-rc0) added an environment variable we use only in the test suite, $GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD
.
But it put the check for this variable at the very top of prepare_commit_graph()
, which is called every time we want to use the commit graph.
Most importantly, it comes before we check the fast-path "did we already try to load?", meaning we end up calling getenv()
for every single use of the commit graph, rather than just when we load.
getenv()
is allowed to have unexpected side effects, but that shouldn't
be a problem here; we're lazy-loading the graph so it's clear that at
least one invocation of this function is going to call it.
But it is inefficient. getenv()
typically has to do a linear search
through the environment space.
We could memoize the call, but it's simpler still to just bump the check down to the actual loading step. That's fine for our sole user in t5318, and produces this minor real-world speedup:
[before]
Benchmark #1: git -C linux rev-list HEAD >/dev/null
Time (mean ± σ): 1.460 s ± 0.017 s [User: 1.174 s, System: 0.285 s]
Range (min … max): 1.440 s … 1.491 s 10 runs
[after]
Benchmark #1: git -C linux rev-list HEAD >/dev/null
Time (mean ± σ): 1.391 s ± 0.005 s [User: 1.118 s, System: 0.273 s]
Range (min … max): 1.385 s … 1.399 s 10 runs
Git 2.24(2019 年第 4 季度)还包括回归修复。
参见 commit cb99a34, commit e88aab9 (24 Oct 2019) by Derrick Stolee (derrickstolee
)。
(由 Junio C Hamano -- gitster
-- in commit dac1d83 合并,2019 年 11 月 4 日)
commit-graph
: fix writing first commit-graph during fetch
Reported-by: Johannes Schindelin
Helped-by: Jeff King
Helped-by: Szeder Gábor
Signed-off-by: Derrick Stolee
The previous commit includes a failing test for an issue around fetch.writeCommitGraph and fetching in a repo with a submodule. Here, we fix that bug and set the test to "test_expect_success"
.
The problem arises with this set of commands when the remote repo at <url>
has a submodule.
Note that --recurse-submodules
is not needed to demonstrate the bug.
$ git clone <url> test
$ cd test
$ git -c fetch.writeCommitGraph=true fetch origin
Computing commit graph generation numbers: 100% (12/12), done.
BUG: commit-graph.c:886: missing parent <hash1> for commit <hash2>
Aborted (core dumped)
As an initial fix, I converted the code in builtin/fetch.c
that calls write_commit_graph_reachable()
to instead launch a "git commit-graph
write --reachable --split
" process. That code worked, but is not how we want the feature to work long-term.
That test did demonstrate that the issue must be something to do with internal state of the 'git fetch' process.
The write_commit_graph()
method in commit-graph.c
ensures the commits we plan to write are "closed under reachability" using close_reachable()
.
This method walks from the input commits, and uses the UNINTERESTING
flag to mark which commits have already been visited. This allows the walk to take O(N)
time, where N
is the number of commits, instead of O(P)
time, where P
is the number of paths. (The number of paths can be exponential in the number of commits.)
However, the UNINTERESTING
flag is used in lots of places in the codebase. This flag usually means some barrier to stop a commit walk, such as in revision-walking to compare histories.
It is not often cleared after the walk completes because the starting points of those walks do not have the UNINTERESTING
flag, and clear_commit_marks()
would stop immediately.
This is happening during a 'git fetch
' call with a remote. The fetch negotiation is comparing the remote refs with the local refs and marking some commits as UNINTERESTING
.
I tested running clear_commit_marks_many()
to clear the UNINTERESTING flag inside close_reachable()
, but the tips did not have the flag, so that did nothing.
It turns out that the calculate_changed_submodule_paths()
method is at fault. Thanks, Peff, for pointing out this detail! More specifically, for each submodule, the collect_changed_submodules()
runs a revision walk to essentially do file-history on the list of submodules. That revision walk marks commits UNININTERESTING
if they are simplified away by not changing the submodule.
Instead, I finally arrived on the conclusion that I should use a flag that is not used in any other part of the code. In commit-reach.c
, a number of flags were defined for commit walk algorithms. The REACHABLE
flag seemed like it made the most sense, and it seems it was not actually used in the file.
The REACHABLE
flag was used in early versions of commit-reach.c
, but was removed by 4fbcca4 ("commit-reach
: make can_all_from_reach
... linear", 2018-07-20, v2.20.0-rc0).
Add the REACHABLE
flag to commit-graph.c
and use it instead of UNINTERESTING in close_reachable()
.
This fixes the bug in manual testing.
从多个远程服务器并行获取到同一个存储库与最近更改(可选)在获取作业完成后更新提交图的交互很糟糕,因为这些并行获取相互竞争。
已通过 Git 2.25(2020 年第一季度)更正。
参见 commit 7d8e72b, commit c14e6e7 (03 Nov 2019) by Johannes Schindelin (dscho
)。
(由 Junio C Hamano -- gitster
-- in commit bcb06e2 合并,2019 年 12 月 1 日)
fetch
: add the command-line option --write-commit-graph
Signed-off-by: Johannes Schindelin
This option overrides the config setting fetch.writeCommitGraph
, if both are set.
并且:
fetch
: avoid locking issues between fetch.jobs/fetch.writeCommitGraph
Signed-off-by: Johannes Schindelin
When both fetch.jobs
and fetch.writeCommitGraph
is set, we currently try to write the commit graph in each of the concurrent fetch jobs, which frequently leads to error messages like this one:
fatal: Unable to create '.../.git/objects/info/commit-graphs/commit-graph-chain.lock': File exists.
Let's avoid this by holding off from writing the commit graph until all fetch jobs are done.
在获取用于拆分结果文件的参数的计算虚假值时写入拆分提交图文件的代码,已用 Git 2.25(2020 年第一季度)更正。
参见commit 63020f1 (02 Jan 2020) by Derrick Stolee (derrickstolee
)。
(由 Junio C Hamano -- gitster
-- in commit 037f067 合并,2020 年 1 月 6 日)
commit-graph
: prefer default size_mult
when given zero
Signed-off-by: Derrick Stolee
In 50f26bd ("fetch
: add fetch.writeCommitGraph config setting", 2019-09-02, Git v2.24.0-rc0 -- merge listed in batch #4), the fetch builtin added the capability to write a commit-graph using the "--split
" feature.
This feature creates multiple commit-graph files, and those can merge based on a set of "split options" including a size multiple.
The default size multiple is 2, which intends to provide a log_2
N depth of the commit-graph chain where N is the number of commits.
However, I noticed during dogfooding that my commit-graph chains were becoming quite large when left only to builds by 'git fetch
'.
It turns out that in split_graph_merge_strategy()
, we default the size_mult
variable to 2, except we override it with the context's split_opts
if they exist.
In builtin/fetch.c
, we create such a split_opts,
but do not populate it with values.
This problem is due to two failures:
- It is unclear that we can add the flag
COMMIT_GRAPH_WRITE_SPLIT
with a NULL
split_opts
.
- If we have a non-NULL
split_opts,
then we override the default values even if a zero value is given.
Correct both of these issues.
- First, do not override
size_mult
when the options provide a zero value.
- Second, stop creating a
split_opts
in the fetch builtin.
请注意,当使用 magic pathspec.
时,git log
在 Git 2.22(2019 年 5 月)和 Git 2.27(2020 年第二季度)之间被破坏
“git log :/a/b/
”的命令行解析在没有人注意到的情况下中断了大约整整一年,现已更正。
参见 commit 0220461 (10 Apr 2020) by Jeff King (peff
)。
参见 commit 5ff4b92 (10 Apr 2020) by Junio C Hamano (gitster
)。
(由 Junio C Hamano -- gitster
-- in commit 95ca489 合并,2020 年 4 月 22 日)
sha1-name
: do not assume that the ref store is initialized
Reported-by: Érico Rolim
c931ba4e ("sha1
-name.c``: remove the_repo
from handle_one_ref()
", 2019-04-16, Git v2.22.0-rc0 -- merge listed in batch #8) replaced the use of for_each_ref()
helper, which works with the main ref store of the default repository instance, with refs_for_each_ref()
, which can work on any ref store instance, by assuming that the repository instance the function is given has its ref store already initialized.
But it is possible that nobody has initialized it, in which case, the code ends up dereferencing a NULL
pointer.
并且:
repository
: mark the "refs" pointer as private
Signed-off-by: Jeff King
The "refs" pointer in a struct repository starts life as NULL
, but then is lazily initialized when it is accessed via get_main_ref_store()
.
However, it's easy for calling code to forget this and access it directly, leading to code which works some of the time, but fails if it is called before anybody else accesses the refs.
This was the cause of the bug fixed by 5ff4b920eb ("sha1-name
: do not assume that the ref store is initialized", 2020-04-09, Git v2.27.0 -- merge listed in batch #3). In order to prevent similar bugs, let's more clearly mark the "refs" field as private.
还有另一个提高 git log
性能的途径,它建立在提到的提交图 .
之上
Git 2.27(2020 年第 2 季度)引入了一个 提交图扩展 以使用 有效地检查每次提交时修改的路径 Bloom filters.
参见 commit caf388c (09 Apr 2020), and commit e369698 (30 Mar 2020) by Derrick Stolee (derrickstolee
)。
参见 commit d5b873c, commit a759bfa, commit 42e50e7, commit a56b946, commit d38e07b, commit 1217c03, commit 76ffbca (06 Apr 2020), and commit 3d11275, commit f97b932, commit ed591fe, commit f1294ea, commit f52207a, commit 3be7efc (30 Mar 2020) by Garima Singh (singhgarima
)。
参见 commit d21ee7d (30 Mar 2020) by Jeff King (peff
)。
(由 Junio C Hamano -- gitster
-- in commit 9b6606f 合并,2020 年 5 月 1 日)
revision.c
: use Bloom filters to speed up path based revision walks
Helped-by: Derrick Stolee <dstolee@microsoft.com
Helped-by: SZEDER Gábor
Helped-by: Jonathan Tan
Signed-off-by: Garima Singh
Revision walk will now use Bloom filters for commits to speed up revision walks for a particular path (for computing history for that path), if they are present in the commit-graph file.
We load the Bloom filters during the prepare_revision_walk
step, currently only when dealing with a single pathspec.
Extending it to work with multiple pathspecs can be explored and built on top of this series in the future.
While comparing trees in rev_compare_trees()
, if the Bloom filter says that the file is not different between the two trees, we don't need to compute the expensive diff.
This is where we get our performance gains.
The other response of the Bloom filter is '`:maybe', in which case we fall back to the full diff calculation to determine if the path was changed in the commit.
We do not try to use Bloom filters when the '--walk-reflogs
' option is specified.
The '--walk-reflogs
' option does not walk the commit ancestry chain like the rest of the options.
Incorporating the performance gains when walking reflog entries would add more complexity, and can be explored in a later series.
Performance Gains: We tested the performance of git log -- <path>
on the git repo, the linux and some internal large repos, with a variety of paths of varying depths.
On the git and linux repos:
- we observed a 2x to 5x speed up.
On a large internal repo with files seated 6-10 levels deep in the tree:
- we observed 10x to 20x speed ups, with some paths going up to 28 times faster.
但是:修复(使用 Git 2.27,2020 年第二季度)模糊器发现的漏洞。
参见 commit fbda77c (04 May 2020) by Jonathan Tan (jhowtan
)。
(由 Junio C Hamano -- gitster
-- in commit 95875e0 合并,2020 年 5 月 8 日)
commit-graph
: avoid memory leaks
Signed-off-by: Jonathan Tan
Reviewed-by: Derrick Stolee
A fuzzer running on the entry point provided by fuzz-commit-graph.c
revealed a memory leak when parse_commit_graph()
creates a struct bloom_filter_settings
and then returns early due to error.
Fix that error by always freeing that struct first (if it exists) before returning early due to error.
While making that change, I also noticed another possible memory leak - when the BLOOMDATA
chunk is provided but not BLOOMINDEXES
.
Also fix that error.
Git 2.27(2020 年第 2 季度)再次改进布隆过滤器:
参见 commit b928e48 (11 May 2020) by SZEDER Gábor (szeder
)。
参见 commit 2f6775f, commit 65c1a28, commit 8809328, commit 891c17c (11 May 2020), and commit 54c337b, commit eb591e4 (01 May 2020) by Derrick Stolee (derrickstolee
)。
(由 Junio C Hamano -- gitster
-- in commit 4b1e5e5 合并,2020 年 5 月 14 日)
bloom
: de-duplicate directory entries
Signed-off-by: Derrick Stolee
When computing a changed-path Bloom filter, we need to take the files that changed from the diff computation and extract the parent directories. That way, a directory pathspec such as "Documentation
" could match commits that change "Documentation/git.txt
".
However, the current code does a poor job of this process.
The paths are added to a hashmap, but we do not check if an entry already exists with that path.
This can create many duplicate entries and cause the filter to have a much larger length than it should.
This means that the filter is more sparse than intended, which helps the false positive rate, but wastes a lot of space.
Properly use hashmap_get()
before hashmap_add()
.
Also be sure to include a comparison function so these can be matched correctly.
This has an effect on a test in t0095-bloom.sh
.
This makes sense, there are ten changes inside "smallDir
" so the total number of paths in the filter should be 11.
This would result in 11 * 10 bits required, and with 8 bits per byte, this results in 14 bytes.
在 Git 2.28(2020 年第 3 季度)中,“git log -L...
”现在利用“此提交涉及哪些路径?”存储在提交图系统中的信息。
为此,使用布隆过滤器。
参见 commit f32dde8 (11 May 2020) by Derrick Stolee (derrickstolee
)。
参见 commit 002933f, commit 3cb9d2b, commit 48da94b, commit d554672 (11 May 2020) by SZEDER Gábor (szeder
)。
(由 Junio C Hamano -- gitster
-- in commit c3a0282 合并,2020 年 6 月 9 日)
line-log
: integrate with changed-path
Bloom filters
Signed-off-by: Derrick Stolee
The previous changes to the line-log machinery focused on making the first result appear faster. This was achieved by no longer walking the entire commit history before returning the early results.
There is still another way to improve the performance: walk most commits much faster. Let's use the changed-path Bloom filters to reduce time spent computing diffs.
Since the line-log
computation requires opening blobs and checking the content-diff
, there is still a lot of necessary computation that cannot be replaced with changed-path Bloom filters.
The part that we can reduce is most effective when checking the history of a file that is deep in several directories and those directories are modified frequently.
In this case, the computation to check if a commit is TREESAME
to its first parent takes a large fraction of the time.
That is ripe for improvement with changed-path Bloom filters.
We must ensure that prepare_to_use_bloom_filters()
is called in revision.c
so that the bloom_filter_settings
are loaded into the struct rev_info
from the commit-graph.
Of course, some cases are still forbidden, but in the line-log
case the pathspec is provided in a different way than normal.
Since multiple paths and segments could be requested, we compute the struct bloom_key
data dynamically during the commit walk. This could likely be improved, but adds code complexity that is not valuable at this time.
There are two cases to care about: merge commits and "ordinary" commits.
- Merge commits have multiple parents, but if we are TREESAME to our first parent in every range, then pass the blame for all ranges to the first parent.
- Ordinary commits have the same condition, but each is done slightly differently in the
process_ranges_[merge|ordinary]_commit()
methods.
By checking if the changed-path Bloom filter can guarantee TREESAME, we can avoid that tree-diff cost. If the filter says "probably changed", then we need to run the tree-diff and then the blob-diff if there was a real edit.
The Linux kernel repository is a good testing ground for the performance improvements claimed here.
There are two different cases to test:
- The first is the "entire history" case, where we output the entire history to
/dev/null
to see how long it would take to compute the full line-log history.
- The second is the "first result" case, where we find how long it takes to show the first value, which is an indicator of how quickly a user would see responses when waiting at a terminal.
To test, I selected the paths that were changed most frequently in the top 10,000 commits using this command (stolen from Whosebug):
git log --pretty=format: --name-only -n 10000 | sort | \
uniq -c | sort -rg | head -10
which results in
121 MAINTAINERS
63 fs/namei.c
60 arch/x86/kvm/cpuid.c
59 fs/io_uring.c
58 arch/x86/kvm/vmx/vmx.c
51 arch/x86/kvm/x86.c
45 arch/x86/kvm/svm.c
42 fs/btrfs/disk-io.c
42 Documentation/scsi/index.rst
(along with a bogus first result).
It appears that the path arch/x86/kvm/svm.c
was renamed, so we ignore that entry. This leaves the following results for the real command time:
| | Entire History | First Result |
| Path | Before | After | Before | After |
|------------------------------|--------|--------|--------|--------|
| MAINTAINERS | 4.26 s | 3.87 s | 0.41 s | 0.39 s |
| fs/namei.c | 1.99 s | 0.99 s | 0.42 s | 0.21 s |
| arch/x86/kvm/cpuid.c | 5.28 s | 1.12 s | 0.16 s | 0.09 s |
| fs/io_uring.c | 4.34 s | 0.99 s | 0.94 s | 0.27 s |
| arch/x86/kvm/vmx/vmx.c | 5.01 s | 1.34 s | 0.21 s | 0.12 s |
| arch/x86/kvm/x86.c | 2.24 s | 1.18 s | 0.21 s | 0.14 s |
| fs/btrfs/disk-io.c | 1.82 s | 1.01 s | 0.06 s | 0.05 s |
| Documentation/scsi/index.rst | 3.30 s | 0.89 s | 1.46 s | 0.03 s |
It is worth noting that the least speedup comes for the MAINTAINERS file which is:
- edited frequently,
- low in the directory hierarchy, and
- quite a large file.
All of those points lead to spending more time doing the blob diff and less time doing the tree diff.
Still, we see some improvement in that case and significant improvement in other cases.
A 2-4x speedup is likely the more typical case as opposed to the small 5% change for that file.
在 Git 2.29(2020 年第 4 季度)中,使用独立实现的想法改进了更改路径布隆过滤器。
参见 commit 7fbfe07, commit bb4d60e, commit 5cfa438, commit 2ad4f1a, commit fa79653, commit 0ee3cb8, commit 1df15f8, commit 6141cdf, commit cb9daf1, commit 35a9f1e (05 Jun 2020) by SZEDER Gábor (szeder
)。
(由 Junio C Hamano -- gitster
-- in commit de6dda0 合并,2020 年 7 月 30 日)
commit-graph
: simplify parse_commit_graph()
#1
Signed-off-by: SZEDER Gábor
Signed-off-by: Derrick Stolee
While we iterate over all entries of the Chunk Lookup table we make sure that we don't attempt to read past the end of the mmap-ed commit-graph file, and check in each iteration that the chunk ID and offset we are about to read is still within the mmap-ed memory region. However, these checks in each iteration are not really necessary, because the number of chunks in the commit-graph file is already known before this loop from the just parsed commit-graph header.
So let's check that the commit-graph file is large enough for all entries in the Chunk Lookup table before we start iterating over those entries, and drop those per-iteration checks.
While at it, take into account the size of everything that is necessary to have a valid commit-graph file, i.e. the size of the header, the size of the mandatory OID Fanout chunk, and the size of the signature in the trailer as well.
Note that this necessitates the change of the error message as well.se
The Chunk Lookup table stores the chunks' starting offset in the commit-graph file, not their sizes.
Consequently, the size of a chunk can only be calculated by subtracting its offset from the offset of the subsequent chunk (or that of the terminating label).
This is currently implemented in a bit complicated way: as we iterate over the entries of the Chunk Lookup table, we check the id of each chunk and store its starting offset, then we check the id of the last seen chunk and calculate its size using its previously saved offset.
At the moment there is only one chunk for which we calculate its size, but this patch series will add more, and the repeated chunk id checks are not that pretty.
Instead let's read ahead the offset of the next chunk on each iteration, so we can calculate the size of each chunk right away, right where we store its starting offset.
我正在尝试从这样的几个存储库中提取 git 日志:
git log --pretty=format:%H\t%ae\t%an\t%at\t%s --numstat
对于较大的存储库(如 rails/rails),生成日志需要 35 秒以上的时间。
有没有办法提高这种性能?
我的第一个想法是改善您的 IO,但我使用 SSD 针对 rails 存储库进行了测试,得到了类似的结果:30 秒。
--numstat
是导致一切变慢的原因,否则 git-log
即使格式化也可以在 1 秒内完成。做一个差异是昂贵的,所以如果你能从你的过程中删除它,那将极大地加快速度。也许事后再做。
否则,如果您使用 git-log
自己的搜索工具过滤日志条目,这将减少需要进行比较的条目数量。例如,git log --grep=foo --numstat
只需一秒钟。They're in the docs under "Commit Limiting"。这可以大大减少 git 必须格式化的条目数。修订范围、日期过滤器、作者过滤器、日志消息 grepping...所有这些都可以提高 git-log
在大型存储库上的性能,同时执行昂贵的操作。
你是对的,它确实需要 20 到 35 秒才能生成关于 56'000 次提交的报告,生成 224'000 行 (15MiB) 的输出。我实际上认为这是相当不错的表现,但你不认为;好的
因为您是从不变的数据库中使用固定格式生成报告,所以您只需执行一次。之后就可以使用git log
的缓存结果,跳过耗时生成。例如:
git log --pretty=format:%H\t%ae\t%an\t%at\t%s --numstat > log-pretty.txt
您可能想知道在整个报告中搜索感兴趣的数据需要多长时间。这是一个有价值的问题:
$ tail -1 log-pretty.txt
30 0 railties/test/webrick_dispatcher_test.rb
$ time grep railties/test/webrick_dispatcher_test.rb log-pretty.txt
…
30 0 railties/test/webrick_dispatcher_test.rb
real 0m0.012s
…
不错,"cache" 的引入将所需时间从 35+ 秒减少到十几毫秒。这几乎快了 3000 倍。
TLDR;作为 mentioned in GitMerge 2019:
git config --global core.commitGraph true
git config --global gc.writeCommitGraph true
cd /path/to/repo
git commit-graph write
实际上(见最后),Git 2.24+(2019 年第 3 季度)不需要前两个配置:默认情况下它们是 true
。
如T4cC0re mentions in
If you are on git version 2.29 or above you should rather run:
git commit-graph write --reachable --changed-paths
This will pre-compute file paths, so that
git log
commands that are scoped to files also benefit from this cache.
Git 2.18(2018 年第 2 季度)将提高 git log
性能:
参见 commit 902f5a2 (24 Mar 2018) by René Scharfe (rscharfe
)。
参见 commit 0aaf05b, commit 3d475f4 (22 Mar 2018) by Derrick Stolee (derrickstolee
)。
参见 commit 626fd98 (22 Mar 2018) by brian m. carlson (bk2204
)。
(由 Junio C Hamano -- gitster
-- in commit 51f813c 合并,2018 年 4 月 10 日)
sha1_name
: usebsearch_pack()
for abbreviations
When computing abbreviation lengths for an object ID against a single packfile, the method
find_abbrev_len_for_pack()
currently implements binary search.
This is one of several implementations.
One issue with this implementation is that it ignores the fanout table in thepack-index
.Translate this binary search to use the existing
bsearch_pack()
method that correctly uses a fanout table.Due to the use of the fanout table, the abbreviation computation is slightly faster than before.
For a fully-repacked copy of the Linux repo, the following 'git log' commands improved:
* git log --oneline --parents --raw Before: 59.2s After: 56.9s Rel %: -3.8% * git log --oneline --parents Before: 6.48s After: 5.91s Rel %: -8.9%
同Git 2.18增加了一个commits graph:预先计算并将祖先遍历所需的信息存储在一个单独的文件中以优化graph walking。
参见 commit 7547b95, commit 3d5df01, commit 049d51a, commit 177722b, commit 4f2542b, commit 1b70dfd, commit 2a2e32b (10 Apr 2018), and commit f237c8b, commit 08fd81c, commit 4ce58ee, commit ae30d7b, commit b84f767, commit cfe8321, commit f2af9f5 (02 Apr 2018) by Derrick Stolee (derrickstolee
)。
(由 Junio C Hamano -- gitster
-- in commit b10edb2 合并,2018 年 5 月 8 日)
commit
: integrate commit graph with commit parsing
Teach Git to inspect a commit graph file to supply the contents of a struct commit when calling
parse_commit_gently()
.
This implementation satisfies all post-conditions on the struct commit, including loading parents, the root tree, and the commit date.If
core.commitGraph
isfalse
, then do not check graph files.In test script t5318-commit-graph.sh, add
output-matching
conditions on read-only graph operations.By loading commits from the graph instead of parsing commit buffers, we save a lot of time on long commit walks.
Here are some performance results for a copy of the Linux repository where 'master' has 678,653 reachable commits and is behind '
origin/master
' by 59,929 commits.| Command | Before | After | Rel % | |----------------------------------|--------|--------|-------| | log --oneline --topo-order -1000 | 8.31s | 0.94s | -88% | | branch -vv | 1.02s | 0.14s | -86% | | rev-list --all | 5.89s | 1.07s | -81% | | rev-list --all --objects | 66.15s | 58.45s | -11% |
要了解有关提交图的更多信息,请参阅“How does 'git log --graph
' work?”。
相同 Git 2.18(2018 年第 2 季度)添加了延迟加载树。
代码已学会使用存储的重复信息 在提交图文件中学习提交的树对象名称 避免在有意义时打开和解析提交对象 这样做。
参见commit 279ffad (30 Apr 2018) by SZEDER Gábor (szeder
)。
参见 commit 7b8a21d, commit 2e27bd7, commit 5bb03de, commit 891435d (06 Apr 2018) by Derrick Stolee (derrickstolee
)。
(由 Junio C Hamano -- gitster
-- in commit c89b6e1 合并,2018 年 5 月 23 日)
commit-graph
: lazy-load trees for commits
The commit-graph file provides quick access to commit data, including the OID of the root tree for each commit in the graph. When performing a deep commit-graph walk, we may not need to load most of the trees for these commits.
Delay loading the tree object for a commit loaded from the graph until requested via
get_commit_tree()
.
Do not lazy-load trees for commits not in the graph, since that requires duplicate parsing and the relative peformance improvement when trees are not needed is small.On the Linux repository, performance tests were run for the following command:
git log --graph --oneline -1000 Before: 0.92s After: 0.66s Rel %: -28.3%
Git 2.21(2019 年第一季度)添加了 松散缓存 .
参见 commit 8be88db (07 Jan 2019), and commit 4cea1ce, commit d4e19e5, commit 0000d65 (06 Jan 2019) by René Scharfe (rscharfe
)。
(由 Junio C Hamano -- gitster
-- in commit eb8638a 合并,2019 年 1 月 18 日)
object-store
: use oneoid_array
per subdirectory for loose cache
The loose objects cache is filled one subdirectory at a time as needed.
It is stored in anoid_array
, which has to be resorted after each add operation.
So when querying a wide range of objects, the partially filled array needs to be resorted up to 255 times, which takes over 100 times longer than sorting once.Use one
oid_array
for each subdirectory.
This ensures that entries have to only be sorted a single time. It also avoids eight binary search steps for each cache lookup as a small bonus.The cache is used for collision checks for the log placeholders
%h
,%t
and%p
, and we can see the change speeding them up in a repository with ca. 100 objects per subdirectory:$ git count-objects 26733 objects, 68808 kilobytes Test HEAD^ HEAD -------------------------------------------------------------------- 4205.1: log with %H 0.51(0.47+0.04) 0.51(0.49+0.02) +0.0% 4205.2: log with %h 0.84(0.82+0.02) 0.60(0.57+0.03) -28.6% 4205.3: log with %T 0.53(0.49+0.04) 0.52(0.48+0.03) -1.9% 4205.4: log with %t 0.84(0.80+0.04) 0.60(0.59+0.01) -28.6% 4205.5: log with %P 0.52(0.48+0.03) 0.51(0.50+0.01) -1.9% 4205.6: log with %p 0.85(0.78+0.06) 0.61(0.56+0.05) -28.2% 4205.7: log with %h-%h-%h 0.96(0.92+0.03) 0.69(0.64+0.04) -28.1%
Git 2.22(2019 年 4 月)在使用从提交图文件读取的数据之前检查错误。
参见 commit 93b4405, commit 43d3561, commit 7b8ce9c, commit 67a530f, commit 61df89c, commit 2ac138d (25 Mar 2019), and commit 945944c, commit f6761fa (21 Feb 2019) by Ævar Arnfjörð Bjarmason (avar
)。
(由 Junio C Hamano -- gitster
-- in commit a5e4be2 合并,2019 年 4 月 25 日)
commit-graph
write: don't die if the existing graph is corrupt
When the
commit-graph
is written we end up callingparse_commit()
. This will in turn invoke code that'll consult the existingcommit-graph
about the commit, if the graph is corrupted we die.We thus get into a state where a failing "
commit-graph verify
" can't be followed-up with a "commit-graph write
" ifcore.commitGraph=true
is set, the graph either needs to be manually removed to proceed, orcore.commitGraph
needs to be set to "false".Change the "
commit-graph write
" codepath to use a newparse_commit_no_graph()
helper instead ofparse_commit()
to avoid this.
The latter will callrepo_parse_commit_internal()
withuse_commit_graph=1
as seen in 177722b ("commit
: integrate commit graph with commit parsing", 2018-04-10, Git v2.18.0-rc0).Not using the old graph at all slows down the writing of the new graph by some small amount, but is a sensible way to prevent an error in the existing commit-graph from spreading.
使用 Git 2.24+(2019 年第 3 季度),提交图默认处于活动状态:
参见commit aaf633c, commit c6cc4c5, commit ad0fb65, commit 31b1de6, commit b068d9a, commit 7211b9e (13 Aug 2019) by Derrick Stolee (derrickstolee
)。
(由 Junio C Hamano -- gitster
-- in commit f4f8dfe 合并,2019 年 9 月 9 日)
commit-graph
: turn on commit-graph by default
The commit-graph feature has seen a lot of activity in the past year or so since it was introduced.
The feature is a critical performance enhancement for medium- to large-sized repos, and does not significantly hurt small repos.Change the defaults for
core.commitGraph
andgc.writeCommitGraph
to true so users benefit from this feature by default.
仍然使用 Git 2.24(2019 年第 4 季度),配置变量告诉“git fetch
”在完成后写入提交图。
参见 commit 50f26bd (03 Sep 2019) by Derrick Stolee (derrickstolee
)。
(由 Junio C Hamano -- gitster
-- in commit 5a53509 合并,2019 年 9 月 30 日)
fetch: add fetch.writeCommitGraph config setting
The commit-graph feature is now on by default, and is being written during '
git gc
' by default.
Typically, Git only writes a commit-graph when a 'git gc --auto
' command passes thegc.auto
setting to actualy do work. This means that a commit-graph will typically fall behind the commits that are being used every day.To stay updated with the latest commits, add a step to '
git fetch
' to write a commit-graph after fetching new objects.
Thefetch.writeCommitGraph
config setting enables writing a split commit-graph, so on average the cost of writing this file is very small. Occasionally, the commit-graph chain will collapse to a single level, and this could be slow for very large repos.For additional use, adjust the default to be true when
feature.experimental
is enabled.
并且 Git 2.24(2019 年第 4 季度)仍然如此,commit-graph
更加稳健。
参见 commit 6abada1, commit fbab552 (12 Sep 2019) by Jeff King (peff
)。
(由 Junio C Hamano -- gitster
-- in commit 098e8c6 合并,2019 年 10 月 7 日)
commit-graph
: bumpDIE_ON_LOAD
check to actual load-time
Commit 43d3561 (commit-graph write: don't die if the existing graph is corrupt, 2019-03-25, Git v2.22.0-rc0) added an environment variable we use only in the test suite,
$GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD
.
But it put the check for this variable at the very top ofprepare_commit_graph()
, which is called every time we want to use the commit graph.
Most importantly, it comes before we check the fast-path "did we already try to load?", meaning we end up callinggetenv()
for every single use of the commit graph, rather than just when we load.
getenv()
is allowed to have unexpected side effects, but that shouldn't be a problem here; we're lazy-loading the graph so it's clear that at least one invocation of this function is going to call it.But it is inefficient.
getenv()
typically has to do a linear search through the environment space.We could memoize the call, but it's simpler still to just bump the check down to the actual loading step. That's fine for our sole user in t5318, and produces this minor real-world speedup:
[before] Benchmark #1: git -C linux rev-list HEAD >/dev/null Time (mean ± σ): 1.460 s ± 0.017 s [User: 1.174 s, System: 0.285 s] Range (min … max): 1.440 s … 1.491 s 10 runs [after] Benchmark #1: git -C linux rev-list HEAD >/dev/null Time (mean ± σ): 1.391 s ± 0.005 s [User: 1.118 s, System: 0.273 s] Range (min … max): 1.385 s … 1.399 s 10 runs
Git 2.24(2019 年第 4 季度)还包括回归修复。
参见 commit cb99a34, commit e88aab9 (24 Oct 2019) by Derrick Stolee (derrickstolee
)。
(由 Junio C Hamano -- gitster
-- in commit dac1d83 合并,2019 年 11 月 4 日)
commit-graph
: fix writing first commit-graph during fetchReported-by: Johannes Schindelin
Helped-by: Jeff King
Helped-by: Szeder Gábor
Signed-off-by: Derrick Stolee
The previous commit includes a failing test for an issue around fetch.writeCommitGraph and fetching in a repo with a submodule. Here, we fix that bug and set the test to
"test_expect_success"
.The problem arises with this set of commands when the remote repo at
<url>
has a submodule.
Note that--recurse-submodules
is not needed to demonstrate the bug.$ git clone <url> test $ cd test $ git -c fetch.writeCommitGraph=true fetch origin Computing commit graph generation numbers: 100% (12/12), done. BUG: commit-graph.c:886: missing parent <hash1> for commit <hash2> Aborted (core dumped)
As an initial fix, I converted the code in
builtin/fetch.c
that callswrite_commit_graph_reachable()
to instead launch a "git commit-graph
write--reachable --split
" process. That code worked, but is not how we want the feature to work long-term.That test did demonstrate that the issue must be something to do with internal state of the 'git fetch' process.
The
write_commit_graph()
method incommit-graph.c
ensures the commits we plan to write are "closed under reachability" usingclose_reachable()
.
This method walks from the input commits, and uses theUNINTERESTING
flag to mark which commits have already been visited. This allows the walk to takeO(N)
time, whereN
is the number of commits, instead ofO(P)
time, whereP
is the number of paths. (The number of paths can be exponential in the number of commits.)However, the
UNINTERESTING
flag is used in lots of places in the codebase. This flag usually means some barrier to stop a commit walk, such as in revision-walking to compare histories.
It is not often cleared after the walk completes because the starting points of those walks do not have theUNINTERESTING
flag, andclear_commit_marks()
would stop immediately.This is happening during a '
git fetch
' call with a remote. The fetch negotiation is comparing the remote refs with the local refs and marking some commits asUNINTERESTING
.I tested running
clear_commit_marks_many()
to clear the UNINTERESTING flag insideclose_reachable()
, but the tips did not have the flag, so that did nothing.It turns out that the
calculate_changed_submodule_paths()
method is at fault. Thanks, Peff, for pointing out this detail! More specifically, for each submodule, thecollect_changed_submodules()
runs a revision walk to essentially do file-history on the list of submodules. That revision walk marks commitsUNININTERESTING
if they are simplified away by not changing the submodule.Instead, I finally arrived on the conclusion that I should use a flag that is not used in any other part of the code. In
commit-reach.c
, a number of flags were defined for commit walk algorithms. TheREACHABLE
flag seemed like it made the most sense, and it seems it was not actually used in the file.
TheREACHABLE
flag was used in early versions ofcommit-reach.c
, but was removed by 4fbcca4 ("commit-reach
: makecan_all_from_reach
... linear", 2018-07-20, v2.20.0-rc0).Add the
REACHABLE
flag tocommit-graph.c
and use it instead of UNINTERESTING inclose_reachable()
.
This fixes the bug in manual testing.
从多个远程服务器并行获取到同一个存储库与最近更改(可选)在获取作业完成后更新提交图的交互很糟糕,因为这些并行获取相互竞争。
已通过 Git 2.25(2020 年第一季度)更正。
参见 commit 7d8e72b, commit c14e6e7 (03 Nov 2019) by Johannes Schindelin (dscho
)。
(由 Junio C Hamano -- gitster
-- in commit bcb06e2 合并,2019 年 12 月 1 日)
fetch
: add the command-line option--write-commit-graph
Signed-off-by: Johannes Schindelin
This option overrides the config setting
fetch.writeCommitGraph
, if both are set.
并且:
fetch
: avoid locking issues between fetch.jobs/fetch.writeCommitGraphSigned-off-by: Johannes Schindelin
When both
fetch.jobs
andfetch.writeCommitGraph
is set, we currently try to write the commit graph in each of the concurrent fetch jobs, which frequently leads to error messages like this one:fatal: Unable to create '.../.git/objects/info/commit-graphs/commit-graph-chain.lock': File exists.
Let's avoid this by holding off from writing the commit graph until all fetch jobs are done.
在获取用于拆分结果文件的参数的计算虚假值时写入拆分提交图文件的代码,已用 Git 2.25(2020 年第一季度)更正。
参见commit 63020f1 (02 Jan 2020) by Derrick Stolee (derrickstolee
)。
(由 Junio C Hamano -- gitster
-- in commit 037f067 合并,2020 年 1 月 6 日)
commit-graph
: prefer defaultsize_mult
when given zeroSigned-off-by: Derrick Stolee
In 50f26bd ("
fetch
: add fetch.writeCommitGraph config setting", 2019-09-02, Git v2.24.0-rc0 -- merge listed in batch #4), the fetch builtin added the capability to write a commit-graph using the "--split
" feature.
This feature creates multiple commit-graph files, and those can merge based on a set of "split options" including a size multiple.
The default size multiple is 2, which intends to provide alog_2
N depth of the commit-graph chain where N is the number of commits.However, I noticed during dogfooding that my commit-graph chains were becoming quite large when left only to builds by '
git fetch
'.
It turns out that insplit_graph_merge_strategy()
, we default thesize_mult
variable to 2, except we override it with the context'ssplit_opts
if they exist.
Inbuiltin/fetch.c
, we create such asplit_opts,
but do not populate it with values.This problem is due to two failures:
- It is unclear that we can add the flag
COMMIT_GRAPH_WRITE_SPLIT
with aNULL
split_opts
.- If we have a non-NULL
split_opts,
then we override the default values even if a zero value is given.Correct both of these issues.
- First, do not override
size_mult
when the options provide a zero value.- Second, stop creating a
split_opts
in the fetch builtin.
请注意,当使用 magic pathspec.
时,git log
在 Git 2.22(2019 年 5 月)和 Git 2.27(2020 年第二季度)之间被破坏
“git log :/a/b/
”的命令行解析在没有人注意到的情况下中断了大约整整一年,现已更正。
参见 commit 0220461 (10 Apr 2020) by Jeff King (peff
)。
参见 commit 5ff4b92 (10 Apr 2020) by Junio C Hamano (gitster
)。
(由 Junio C Hamano -- gitster
-- in commit 95ca489 合并,2020 年 4 月 22 日)
sha1-name
: do not assume that the ref store is initializedReported-by: Érico Rolim
c931ba4e ("
sha1
-name.c``: removethe_repo
fromhandle_one_ref()
", 2019-04-16, Git v2.22.0-rc0 -- merge listed in batch #8) replaced the use offor_each_ref()
helper, which works with the main ref store of the default repository instance, withrefs_for_each_ref()
, which can work on any ref store instance, by assuming that the repository instance the function is given has its ref store already initialized.But it is possible that nobody has initialized it, in which case, the code ends up dereferencing a
NULL
pointer.
并且:
repository
: mark the "refs" pointer as privateSigned-off-by: Jeff King
The "refs" pointer in a struct repository starts life as
NULL
, but then is lazily initialized when it is accessed viaget_main_ref_store()
.
However, it's easy for calling code to forget this and access it directly, leading to code which works some of the time, but fails if it is called before anybody else accesses the refs.This was the cause of the bug fixed by 5ff4b920eb ("
sha1-name
: do not assume that the ref store is initialized", 2020-04-09, Git v2.27.0 -- merge listed in batch #3). In order to prevent similar bugs, let's more clearly mark the "refs" field as private.
还有另一个提高 git log
性能的途径,它建立在提到的提交图
Git 2.27(2020 年第 2 季度)引入了一个 提交图扩展 以使用 有效地检查每次提交时修改的路径 Bloom filters.
参见 commit caf388c (09 Apr 2020), and commit e369698 (30 Mar 2020) by Derrick Stolee (derrickstolee
)。
参见 commit d5b873c, commit a759bfa, commit 42e50e7, commit a56b946, commit d38e07b, commit 1217c03, commit 76ffbca (06 Apr 2020), and commit 3d11275, commit f97b932, commit ed591fe, commit f1294ea, commit f52207a, commit 3be7efc (30 Mar 2020) by Garima Singh (singhgarima
)。
参见 commit d21ee7d (30 Mar 2020) by Jeff King (peff
)。
(由 Junio C Hamano -- gitster
-- in commit 9b6606f 合并,2020 年 5 月 1 日)
revision.c
: use Bloom filters to speed up path based revision walksHelped-by: Derrick Stolee <dstolee@microsoft.com
Helped-by: SZEDER Gábor
Helped-by: Jonathan Tan
Signed-off-by: Garima Singh
Revision walk will now use Bloom filters for commits to speed up revision walks for a particular path (for computing history for that path), if they are present in the commit-graph file.
We load the Bloom filters during the
prepare_revision_walk
step, currently only when dealing with a single pathspec.
Extending it to work with multiple pathspecs can be explored and built on top of this series in the future.While comparing trees in
rev_compare_trees()
, if the Bloom filter says that the file is not different between the two trees, we don't need to compute the expensive diff.
This is where we get our performance gains.The other response of the Bloom filter is '`:maybe', in which case we fall back to the full diff calculation to determine if the path was changed in the commit.
We do not try to use Bloom filters when the '
--walk-reflogs
' option is specified.
The '--walk-reflogs
' option does not walk the commit ancestry chain like the rest of the options.
Incorporating the performance gains when walking reflog entries would add more complexity, and can be explored in a later series.
Performance Gains: We tested the performance of
git log -- <path>
on the git repo, the linux and some internal large repos, with a variety of paths of varying depths.On the git and linux repos:
- we observed a 2x to 5x speed up.
On a large internal repo with files seated 6-10 levels deep in the tree:
- we observed 10x to 20x speed ups, with some paths going up to 28 times faster.
但是:修复(使用 Git 2.27,2020 年第二季度)模糊器发现的漏洞。
参见 commit fbda77c (04 May 2020) by Jonathan Tan (jhowtan
)。
(由 Junio C Hamano -- gitster
-- in commit 95875e0 合并,2020 年 5 月 8 日)
commit-graph
: avoid memory leaksSigned-off-by: Jonathan Tan
Reviewed-by: Derrick Stolee
A fuzzer running on the entry point provided by
fuzz-commit-graph.c
revealed a memory leak whenparse_commit_graph()
creates a structbloom_filter_settings
and then returns early due to error.Fix that error by always freeing that struct first (if it exists) before returning early due to error.
While making that change, I also noticed another possible memory leak - when the
BLOOMDATA
chunk is provided but notBLOOMINDEXES
.
Also fix that error.
Git 2.27(2020 年第 2 季度)再次改进布隆过滤器:
参见 commit b928e48 (11 May 2020) by SZEDER Gábor (szeder
)。
参见 commit 2f6775f, commit 65c1a28, commit 8809328, commit 891c17c (11 May 2020), and commit 54c337b, commit eb591e4 (01 May 2020) by Derrick Stolee (derrickstolee
)。
(由 Junio C Hamano -- gitster
-- in commit 4b1e5e5 合并,2020 年 5 月 14 日)
bloom
: de-duplicate directory entriesSigned-off-by: Derrick Stolee
When computing a changed-path Bloom filter, we need to take the files that changed from the diff computation and extract the parent directories. That way, a directory pathspec such as "
Documentation
" could match commits that change "Documentation/git.txt
".However, the current code does a poor job of this process.
The paths are added to a hashmap, but we do not check if an entry already exists with that path.
This can create many duplicate entries and cause the filter to have a much larger length than it should.
This means that the filter is more sparse than intended, which helps the false positive rate, but wastes a lot of space.Properly use
hashmap_get()
beforehashmap_add()
.
Also be sure to include a comparison function so these can be matched correctly.This has an effect on a test in
t0095-bloom.sh
.
This makes sense, there are ten changes inside "smallDir
" so the total number of paths in the filter should be 11.
This would result in 11 * 10 bits required, and with 8 bits per byte, this results in 14 bytes.
在 Git 2.28(2020 年第 3 季度)中,“git log -L...
”现在利用“此提交涉及哪些路径?”存储在提交图系统中的信息。
为此,使用布隆过滤器。
参见 commit f32dde8 (11 May 2020) by Derrick Stolee (derrickstolee
)。
参见 commit 002933f, commit 3cb9d2b, commit 48da94b, commit d554672 (11 May 2020) by SZEDER Gábor (szeder
)。
(由 Junio C Hamano -- gitster
-- in commit c3a0282 合并,2020 年 6 月 9 日)
line-log
: integrate withchanged-path
Bloom filtersSigned-off-by: Derrick Stolee
The previous changes to the line-log machinery focused on making the first result appear faster. This was achieved by no longer walking the entire commit history before returning the early results.
There is still another way to improve the performance: walk most commits much faster. Let's use the changed-path Bloom filters to reduce time spent computing diffs.Since the
line-log
computation requires opening blobs and checking thecontent-diff
, there is still a lot of necessary computation that cannot be replaced with changed-path Bloom filters.
The part that we can reduce is most effective when checking the history of a file that is deep in several directories and those directories are modified frequently.
In this case, the computation to check if a commit isTREESAME
to its first parent takes a large fraction of the time.
That is ripe for improvement with changed-path Bloom filters.We must ensure that
prepare_to_use_bloom_filters()
is called inrevision.c
so that thebloom_filter_settings
are loaded into the structrev_info
from the commit-graph.
Of course, some cases are still forbidden, but in theline-log
case the pathspec is provided in a different way than normal.Since multiple paths and segments could be requested, we compute the struct
bloom_key
data dynamically during the commit walk. This could likely be improved, but adds code complexity that is not valuable at this time.There are two cases to care about: merge commits and "ordinary" commits.
- Merge commits have multiple parents, but if we are TREESAME to our first parent in every range, then pass the blame for all ranges to the first parent.
- Ordinary commits have the same condition, but each is done slightly differently in the
process_ranges_[merge|ordinary]_commit()
methods.By checking if the changed-path Bloom filter can guarantee TREESAME, we can avoid that tree-diff cost. If the filter says "probably changed", then we need to run the tree-diff and then the blob-diff if there was a real edit.
The Linux kernel repository is a good testing ground for the performance improvements claimed here.
There are two different cases to test:
- The first is the "entire history" case, where we output the entire history to
/dev/null
to see how long it would take to compute the full line-log history.- The second is the "first result" case, where we find how long it takes to show the first value, which is an indicator of how quickly a user would see responses when waiting at a terminal.
To test, I selected the paths that were changed most frequently in the top 10,000 commits using this command (stolen from Whosebug):
git log --pretty=format: --name-only -n 10000 | sort | \ uniq -c | sort -rg | head -10
which results in
121 MAINTAINERS 63 fs/namei.c 60 arch/x86/kvm/cpuid.c 59 fs/io_uring.c 58 arch/x86/kvm/vmx/vmx.c 51 arch/x86/kvm/x86.c 45 arch/x86/kvm/svm.c 42 fs/btrfs/disk-io.c 42 Documentation/scsi/index.rst
(along with a bogus first result).
It appears that the patharch/x86/kvm/svm.c
was renamed, so we ignore that entry. This leaves the following results for the real command time:| | Entire History | First Result | | Path | Before | After | Before | After | |------------------------------|--------|--------|--------|--------| | MAINTAINERS | 4.26 s | 3.87 s | 0.41 s | 0.39 s | | fs/namei.c | 1.99 s | 0.99 s | 0.42 s | 0.21 s | | arch/x86/kvm/cpuid.c | 5.28 s | 1.12 s | 0.16 s | 0.09 s | | fs/io_uring.c | 4.34 s | 0.99 s | 0.94 s | 0.27 s | | arch/x86/kvm/vmx/vmx.c | 5.01 s | 1.34 s | 0.21 s | 0.12 s | | arch/x86/kvm/x86.c | 2.24 s | 1.18 s | 0.21 s | 0.14 s | | fs/btrfs/disk-io.c | 1.82 s | 1.01 s | 0.06 s | 0.05 s | | Documentation/scsi/index.rst | 3.30 s | 0.89 s | 1.46 s | 0.03 s |
It is worth noting that the least speedup comes for the MAINTAINERS file which is:
- edited frequently,
- low in the directory hierarchy, and
- quite a large file.
All of those points lead to spending more time doing the blob diff and less time doing the tree diff.
Still, we see some improvement in that case and significant improvement in other cases.
A 2-4x speedup is likely the more typical case as opposed to the small 5% change for that file.
在 Git 2.29(2020 年第 4 季度)中,使用独立实现的想法改进了更改路径布隆过滤器。
参见 commit 7fbfe07, commit bb4d60e, commit 5cfa438, commit 2ad4f1a, commit fa79653, commit 0ee3cb8, commit 1df15f8, commit 6141cdf, commit cb9daf1, commit 35a9f1e (05 Jun 2020) by SZEDER Gábor (szeder
)。
(由 Junio C Hamano -- gitster
-- in commit de6dda0 合并,2020 年 7 月 30 日)
commit-graph
: simplifyparse_commit_graph()
#1Signed-off-by: SZEDER Gábor
Signed-off-by: Derrick Stolee
While we iterate over all entries of the Chunk Lookup table we make sure that we don't attempt to read past the end of the mmap-ed commit-graph file, and check in each iteration that the chunk ID and offset we are about to read is still within the mmap-ed memory region. However, these checks in each iteration are not really necessary, because the number of chunks in the commit-graph file is already known before this loop from the just parsed commit-graph header.
So let's check that the commit-graph file is large enough for all entries in the Chunk Lookup table before we start iterating over those entries, and drop those per-iteration checks.
While at it, take into account the size of everything that is necessary to have a valid commit-graph file, i.e. the size of the header, the size of the mandatory OID Fanout chunk, and the size of the signature in the trailer as well.Note that this necessitates the change of the error message as well.se
The Chunk Lookup table stores the chunks' starting offset in the commit-graph file, not their sizes.
Consequently, the size of a chunk can only be calculated by subtracting its offset from the offset of the subsequent chunk (or that of the terminating label).
This is currently implemented in a bit complicated way: as we iterate over the entries of the Chunk Lookup table, we check the id of each chunk and store its starting offset, then we check the id of the last seen chunk and calculate its size using its previously saved offset.
At the moment there is only one chunk for which we calculate its size, but this patch series will add more, and the repeated chunk id checks are not that pretty.Instead let's read ahead the offset of the next chunk on each iteration, so we can calculate the size of each chunk right away, right where we store its starting offset.