git log --all 在过滤器分支中不起作用

Question

我正在编写一个 git filter-branch --tree-filter 命令，该命令使用 git log --follow 检查在过滤过程中是否应保留或删除某些文件。

基本上，我想保留包含文件名的提交，即使此文件已重命名 and/or moved.

这是我运行宁的过滤器：

git filter-branch --prune-empty --tree-filter '~/preserve.sh' -- --all

这是我在里面使用的命令 preserve.sh:

git log --pretty=format:'%H' --name-only --follow --all -- "$f"

结果是，当我在新路径中搜索文件时，创建文件的提交被从历史记录中删除，该文件后来被移动到另一个路径，这是不应该发生的。例如：

commit 1: creates foo/hello.txt;

commit 2: moves foo/hello.txt to bar/hello.txt;

using git filter-branch passing bar/hello.txt yields a history with only commit 2.

一开始以为是因为我没有在git log中使用--all，也就是在分析commit 1的时候出现了问题不会找到 foo/hello.txt 因为它只是在过去的历史中查找 bar/hello.txt 没有被提及的任何地方。但后来我添加了 --all，它查看所有提交（包括“未来”的提交），然而，没有任何改变。

我检查了正在创建文件的提交，运行该日志命令并且它有效（列出了 foo/hello.txt 和 bar/hello.txt），所以没有任何问题它。我还通过 filter-b运行ch 记录了 log 命令的结果运行，在这种情况下，我可以在 commit 1 中看到文件是未找到（仅列出 bar/hello.txt）。

我认为这个问题的发生是因为在内部 git 正在将每个提交复制到“新回购”结构，所以在它分析 commit 1 时，较新的提交不会还不存在。

有没有办法解决这个问题，或者有其他方法可以在保留 renames/moves 的同时解决重写历史的问题？

我正在运行修改 this answer 中的脚本版本。

Answer 1

or another way to approach the problem of re-writing history while preserving renames/moves?

考虑使用，因为 git filter-branch is soon deprecated, the new newren/git-filter-repo。

但即使是那个新工具（基于 git fast-export/git fast-import）也不会跟随重命名的文件。

参见 newren/git-filter-repo issue 25，它间接说明了在考虑重命名文件的同时过滤存储库（使用旧的 git filter-branch 或新的 filter-repo 命令）的挑战。

[...] This is consistent with how the rev-list, log, and fast-export git subcommands work. E.g. git log -- src/ledger/bin/app/app.cc won't show any history for other paths that this file was renamed or copied from (or for which parts of it came from).
You used the --follow flag specifically, which is a big hack as even noted in the git log documentation (it mentions that it only works when a single file is specified).
If rev-list/log/fast-export, etc. had a --follow option that followed renames, I could simply expose it from filter-repo, but despite the desire for such an option no one has implemented it in many years.
There's some good challenges there too, e.g. we'd probably want to traverse in topological order and we may need two passes -- one to create the topological ordering, and the second to build up additional paths from renames. (A case where this might be necessary: some branch builds on top of 'master' and has some paths within the specified pathspec that came from a rename of something outside the pathspec at the time 'master' existed. If 'master' was traversed before the other branch, then we'd have already picked the more limited pathspec and miss the extra needed paths.)

But even if --follow implemented following of renames for multiple files or a directory or more, that still wouldn't necessarily be sufficient because perhaps the user needs copy detection (i.e. it wasn't a file renamed from somewhere else, rather it was copied).
But with copy detection it's not as clear if you want the full history of the original; I can imagine that in some cases you would but not others.

And if we start doing either rename or copy detection, then we're moving from well-defined correct behavior to heuristics.
For diffs or logs or even merges that's fine, because the results will be interpreted by a user (even in merge, if the detection is wrong, the user can fixup conflicts and make other edits).
Here, we'd record the results of the heuristics in stone. That's a bit worrying to me...and it also means we'd have to open up a pile of knobs (at the very least a similarity percentage, and whether copies are wanted in addition to renames) for configuration.

All that said, I wanted something like that when I was using it too.
The best compromise I came up with was to have people run 'git filter-repo --analyze' beforehand, look at the renames sub-report, and pick out additional paths by hand based on that to feed to their filter-repo run.
The --analyze option still had a few caveats with the rename detection, but that was mostly fundamental to the problem. Providing it and letting the user decide what to include (though I didn't even bother with copy detection), seemed like the best option I had available.

Answer 2

基本上你想在这里做的是：

构建存储库中所有提交的映射，按哈希 ID 索引。
对于每个提交，确定您希望在运行设置过滤器时保留/使用的路径名。
运行 git filter-branch——或者，在这一点上，只是运行你自己的代码，因为你在步骤 1 中构建的地图，以及你在步骤 2 中计算的东西，是 filter-branch 的重要组成部分——将旧提交复制到新提交。
如果您使用自己的代码，请为最后复制的提交创建或更新分支名称。

你可以git read-tree将每个提交复制到一个索引中——你可以使用主索引，也可以使用临时索引——然后使用Git工具修改索引以便安排其中包含您希望保留的名称和哈希 ID。然后使用 git write-tree 和 git commit-tree 构建您的新提交，就像 filter-branch 一样。

更简单的案例

如果您没有太多的文件备选名称，您可能能够稍微简化这一过程。例如，假设存储库中的历史——提交链——看起来像这样，有两个巨大的历史瓶颈 B1 和 B2:

  _______________________          ________________          _________
 /                       \        /                \        /         \--bra
< large cloud of commits  >--B1--< cloud of commits >--B2--<    ...    >--nch
 \_______________________/        \________________/        \_________/--es

你想要保留的文件名都相同在三个大气泡中的任何一个，但是在提交时 B2 有一个批量重命名因此中间气泡中的名称不同，同样在 B1 处进行了大规模重命名，因此第一个气泡中的名称不同。

在这种情况下，您可以在任何过滤器中执行明确的历史测试——树过滤器、索引过滤器，无论您喜欢什么（但索引过滤器比树过滤器快得多）——以确定要保留哪些文件名。请记住，filter-branch 正在按拓扑顺序一个接一个地复制提交，以便在必须创建任何新复制的子项之前创建新复制的父项。也就是说，它首先处理来自第一组的提交，然后复制瓶颈提交 B1，然后处理来自第二组的提交，依此类推。

正在复制的提交的哈希 ID 可用于您的过滤器（无论您使用哪个或哪些过滤器）：它是 $GIT_COMMIT。所以你只需要测试：

$GIT_COMMIT是B1的祖先吗？如果是，那么您属于第组。
$GIT_COMMIT是B2的祖先吗？如果是，那么您属于 第一组或第二组 。

因此由"preserve names from set of names"组成的索引过滤器可以写成：

if git merge-base --is-ancestor $GIT_COMMIT <hash of B1>; then
    set_of_names=/tmp/list1
elif git merge-base --is-ancestor $GIT_COMMIT <hash of B2>; then
    set_of_names=/tmp/list2
else
    set_of_names=/tmp/list3
fi
...

其中文件 /tmp/list1、/tmp/list2 和 /tmp/list3 包含要保留的文件的名称。您现在只需编写实现 "keep fixed set of file names during index filter operation" 的 ... 代码。这实际上已经完成了，无论如何，在 this answer to extract multiple directories using git-filter-branch 中（正如您今天早些时候发现的那样）。

git log --all 在过滤器分支中不起作用

git log --all doesn't work inside a filter-branch

git

git-filter-branch

git-log

更简单的案例