Git 大文件存储背后的存储机制是什么?
What is the storage mechanism behind Git Large File Storage?
Github 最近引入了 extension 到 git 以不同的方式存储大文件。 extension replaces large files with text pointers inside Git 到底是什么意思?
您可以在git-lfs sources how a "text pointer" is defined中看到:
type Pointer struct {
Version string
Oid string
Size int64
OidType string
}
smudge and clean sources means git-lfs
can use a content filter driver为了:
- 结帐时下载实际文件
- 提交时将它们存储在外部源中。
The core Git LFS idea is that instead of writing large blobs to a Git repository, only a pointer file is written.
version https://git-lfs.github.com/spec/v1
oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
size 12345
(ending \n)
Git LFS needs a URL endpoint to talk to a remote server.
A Git repository can have different Git LFS endpoints for different remotes.
实际文件是从遵守 Git-LFS API 的服务器上传或下载的。
git-lfs
man page 证实了这一点,其中提到:
The actual file gets pushed to a Git LFS API
您需要一个 Git 服务器来实现 API 以支持上传和下载二进制内容。
关于内容过滤驱动(它在 Git 中存在很长时间,早于 lfs,并且在这里被 lfs 使用来添加这个 "large file management" 特性),这是大部分的地方发生的工作:
The smudge filter runs as files are being checked out from the Git repository to the working directory.
Git sends the content of the Git blob as STDIN, and expects the content to write to the working directory as STDOUT.
Read 100 bytes.
If the content is ASCII and matches the pointer file format:
Look for the file in .git/lfs/objects/{OID}.
If it's not there, download it from the server.
Read its contents to STDOUT
Otherwise, simply pass the STDIN out through STDOUT.
The clean filter runs as files are added to repositories.
Git sends the content of the file being added as STDIN, and expects the content to write to Git as STDOUT.
- Stream binary content from STDIN to a temp file, while calculating its SHA-256 signature.
- Check for the file at
.git/lfs/objects/{OID}
.
- If it does not exist:
- Queue the OID to be uploaded.
- Move the temp file to
.git/lfs/objects/{OID}
.
- Delete the temp file.
- Write the pointer file to STDOUT.
Git 2.11(2016 年 11 月)有一个提交,详细说明了它是如何工作的:commit edcc858,由 Martin-Louis Bright 帮助并由 Lars Schneider 签署。
convert
: add filter.<driver>.process
option
Git's clean/smudge mechanism invokes an external filter process for
every single blob that is affected by a filter. If Git filters a lot of
blobs then the startup time of the external filter processes can become
a significant part of the overall Git execution time.
In a preliminary performance test this developer used a clean/smudge
filter written in golang to filter 12,000 files. This process took 364s
with the existing filter mechanism and 5s with the new mechanism. See
details here: git-lfs/git-lfs#1382
This patch adds the filter.<driver>.process
string option which, if
used, keeps the external filter process running and processes all blobs
with the packet format (pkt-line
) based protocol over standard input and
standard output.
The full protocol is explained in detail in Documentation/gitattributes.txt
.
A few key decisions:
- The long running filter process is referred to as filter protocol
version 2 because the existing single shot filter invocation is
considered version 1.
- Git sends a welcome message and expects a response right after the
external filter process has started. This ensures that Git will not
hang if a version 1 filter is incorrectly used with the
filter.<driver>.process
option for version 2 filters. In addition,
Git can detect this kind of error and warn the user.
- The status of a filter operation (e.g. "success" or "error) is set
before the actual response and (if necessary!) re-set after the
response. The advantage of this two step status response is that if
the filter detects an error early, then the filter can communicate
this and Git does not even need to create structures to read the
response.
- All status responses are pkt-line lists terminated with a flush
packet. This allows us to send other status fields with the same
protocol in the future.
因此在 Git 2.12(2017 年第一季度)
中设置了警告
参见 commit 7eeda8b (18 Dec 2016), and commit c6b0831 (03 Dec 2016) by Lars Schneider (larsxschneider
)。
(由 Junio C Hamano -- gitster
-- in commit 08721a0 合并,2016 年 12 月 27 日)
docs
: warn about possible '=
' in clean/smudge filter process values
A pathname value in a clean/smudge filter process "key=value
" pair can
contain the '=
' character (introduced in edcc858).
Make the user aware of this issue in the docs, add a corresponding test case, and fix the issue in filter process value parser of the example implementation in contrib
.
Github 最近引入了 extension 到 git 以不同的方式存储大文件。 extension replaces large files with text pointers inside Git 到底是什么意思?
您可以在git-lfs sources how a "text pointer" is defined中看到:
type Pointer struct {
Version string
Oid string
Size int64
OidType string
}
smudge and clean sources means git-lfs
can use a content filter driver为了:
- 结帐时下载实际文件
- 提交时将它们存储在外部源中。
The core Git LFS idea is that instead of writing large blobs to a Git repository, only a pointer file is written.
version https://git-lfs.github.com/spec/v1
oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
size 12345
(ending \n)
Git LFS needs a URL endpoint to talk to a remote server.
A Git repository can have different Git LFS endpoints for different remotes.
实际文件是从遵守 Git-LFS API 的服务器上传或下载的。
git-lfs
man page 证实了这一点,其中提到:
The actual file gets pushed to a Git LFS API
您需要一个 Git 服务器来实现 API 以支持上传和下载二进制内容。
关于内容过滤驱动(它在 Git 中存在很长时间,早于 lfs,并且在这里被 lfs 使用来添加这个 "large file management" 特性),这是大部分的地方发生的工作:
The smudge filter runs as files are being checked out from the Git repository to the working directory.
Git sends the content of the Git blob as STDIN, and expects the content to write to the working directory as STDOUT.Read 100 bytes.
If the content is ASCII and matches the pointer file format:
Look for the file in .git/lfs/objects/{OID}.If it's not there, download it from the server.
Read its contents to STDOUTOtherwise, simply pass the STDIN out through STDOUT.
The clean filter runs as files are added to repositories.
Git sends the content of the file being added as STDIN, and expects the content to write to Git as STDOUT.
- Stream binary content from STDIN to a temp file, while calculating its SHA-256 signature.
- Check for the file at
.git/lfs/objects/{OID}
.- If it does not exist:
- Queue the OID to be uploaded.
- Move the temp file to
.git/lfs/objects/{OID}
.- Delete the temp file.
- Write the pointer file to STDOUT.
Git 2.11(2016 年 11 月)有一个提交,详细说明了它是如何工作的:commit edcc858,由 Martin-Louis Bright 帮助并由 Lars Schneider 签署。
convert
: addfilter.<driver>.process
optionGit's clean/smudge mechanism invokes an external filter process for every single blob that is affected by a filter. If Git filters a lot of blobs then the startup time of the external filter processes can become a significant part of the overall Git execution time.
In a preliminary performance test this developer used a clean/smudge filter written in golang to filter 12,000 files. This process took 364s with the existing filter mechanism and 5s with the new mechanism. See details here: git-lfs/git-lfs#1382
This patch adds the
filter.<driver>.process
string option which, if used, keeps the external filter process running and processes all blobs with the packet format (pkt-line
) based protocol over standard input and standard output.
The full protocol is explained in detail inDocumentation/gitattributes.txt
.A few key decisions:
- The long running filter process is referred to as filter protocol version 2 because the existing single shot filter invocation is considered version 1.
- Git sends a welcome message and expects a response right after the external filter process has started. This ensures that Git will not hang if a version 1 filter is incorrectly used with the
filter.<driver>.process
option for version 2 filters. In addition, Git can detect this kind of error and warn the user.- The status of a filter operation (e.g. "success" or "error) is set before the actual response and (if necessary!) re-set after the response. The advantage of this two step status response is that if the filter detects an error early, then the filter can communicate this and Git does not even need to create structures to read the response.
- All status responses are pkt-line lists terminated with a flush packet. This allows us to send other status fields with the same protocol in the future.
因此在 Git 2.12(2017 年第一季度)
中设置了警告参见 commit 7eeda8b (18 Dec 2016), and commit c6b0831 (03 Dec 2016) by Lars Schneider (larsxschneider
)。
(由 Junio C Hamano -- gitster
-- in commit 08721a0 合并,2016 年 12 月 27 日)
docs
: warn about possible '=
' in clean/smudge filter process valuesA pathname value in a clean/smudge filter process "
key=value
" pair can contain the '=
' character (introduced in edcc858).
Make the user aware of this issue in the docs, add a corresponding test case, and fix the issue in filter process value parser of the example implementation incontrib
.