无法 git 很好地使用 iconv 和 utf-16
Can't get git to play nice with iconv and utf-16
我试图让 git 将 UTF-16 识别为文本,以允许我将差异和补丁作为本机文本,但我无法使 textconv
参数起作用.
我可以手动调用
iconv -f utf-16 -t utf-8 some-utf-16-file.rc
一切都很好。但是如果我配置我的 .gitconfig 如下
[diff "utf16"]
textconv = "iconv -f utf-16le -t utf-8"
和我的 .git属性:
# Custom for MFC
*.rc text eol=crlf diff=utf16
但是,如果我然后如果我运行 git diff
,显示如下:
iconv: C:/Users/Mahmoud/AppData/Local/Temp/IjLBZ8_OemKey.rc:104:1: incomplete character or shift sequence
使用 procmon,我能够在创建此进程时追踪到它:
sh -c "iconv.exe -f utf-16le -t utf-8 \"$@\"" "iconv.exe -f utf-16le -t utf-8" C:/Users/Mahmoud/AppData/Local/Temp/JLOkVa_OemKey.rc
...我实际上可以 运行 很好(尽管在实际文件中)。
有什么想法吗?
(请注意,我知道让 git 与 UTF-16 一起工作的各种解决方案。我专门试图解决这个问题,即为什么 iconv 本身可以工作,但它不会在 git 调用时工作。此外,此错误最初是在尝试 "duplicate" 问题的链接解决方案之一时遇到的。谢谢大家。)
仅使用 diff
,它应该有效:
*.rc diff=utf16
text
和 eol
导致 git 在将数据传递给 iconv 之前替换行尾,之后它不再是有效的 utf16,因为 .
git 最近开始了解编码,即实际上 iconv
现在在某种程度上是内置的。查看 gitattributes 文档,搜索 working-tree-encoding
[确保您的手册页匹配,因为这是全新的!]
如果(比如说)文件是 windows 机器上没有 bom 的 utf-16 则添加到您的 git 属性文件
some-utf-16-file.rc text working-tree-encoding=UTF-16LE eol=CRLF
如果 utf-16 little endinan (with bom) on *nix make it
some-utf-16-file.rc text working-tree-encoding=UTF-16 eol=LF
Git 2.21(2019 年 2 月)添加了一种新编码 UTF-16LE-BOM:旨在强制编码为小端字节中带有 BOM 的 UTF-16 order,不能直接用iconv
.
生成
参见 commit aab2a1a (30 Jan 2019) by Torsten Bögershausen (tboegi
)。
(由 Junio C Hamano -- gitster
-- in commit 0fa3cc7 合并,2019 年 2 月 7 日)
Support working-tree-encoding "UTF-16LE-BOM"
Users who want UTF-16 files in the working tree set the .gitattributes
like this:
test.txt working-tree-encoding=UTF-16
The unicode standard itself defines 3 allowed ways how to encode UTF-16.
The following 3 versions convert all back to 'g' 'i' 't' in UTF-8:
a) UTF-16, without BOM, big endian:
$ printf "[=11=]0g[=11=]0i[=11=]0t" | iconv -f UTF-16 -t UTF-8 | od -c
0000000 g i t
b) UTF-16, with BOM, little endian:
$ printf "76g[=11=]0i[=11=]0t[=11=]0" | iconv -f UTF-16 -t UTF-8 | od -c
0000000 g i t
c) UTF-16, with BOM, big endian:
$ printf "67[=11=]0g[=11=]0i[=11=]0t" | iconv -f UTF-16 -t UTF-8 | od -c
0000000 g i t
Git uses libiconv
to convert from UTF-8 in the index into ITF-16 in the
working tree.
After a checkout, the resulting file has a BOM and is encoded in "UTF-16",
in the version (c) above.
This is what iconv generates, more details follow below.
iconv
(and libiconv
) can generate UTF-16, UTF-16LE or UTF-16BE:
d) UTF-16
$ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c
0000000 376 377 [=12=] g [=12=] i [=12=] t
e) UTF-16LE
$ printf 'git' | iconv -f UTF-8 -t UTF-16LE | od -c
0000000 g [=12=] i [=12=] t [=12=]
f) UTF-16BE
$ printf 'git' | iconv -f UTF-8 -t UTF-16BE | od -c
0000000 [=12=] g [=12=] i [=12=] t
There is no way to generate version (b) from above in a Git working tree,
but that is what some applications need.
(All fully unicode aware applications should be able to read all 3 variants,
but in practice, we are not there yet).
When producing UTF-16 as an output, iconv
generates the big endian version
with a BOM. (big endian is probably chosen for historical reasons).
iconv
can produce UTF-16 files with little endianess by using "UTF-16LE"
as encoding, and that file does not have a BOM.
Not all users (especially under Windows) are happy with this.
Some tools are not fully unicode aware and can only handle version (b).
Today there is no way to produce version (b) with iconv
(or libiconv
).
Looking into the history of iconv
, it seems as if version (c) will be used in all future iconv
versions (for compatibility reasons).
Solve this dilemma and introduce a Git-specific "UTF-16LE-BOM
".
libiconv can not handle the encoding, so Git pick it up, handles the BOM
and uses libiconv to convert the rest of the stream. (UTF-16BE-BOM is added for consistency)
我试图让 git 将 UTF-16 识别为文本,以允许我将差异和补丁作为本机文本,但我无法使 textconv
参数起作用.
我可以手动调用
iconv -f utf-16 -t utf-8 some-utf-16-file.rc
一切都很好。但是如果我配置我的 .gitconfig 如下
[diff "utf16"]
textconv = "iconv -f utf-16le -t utf-8"
和我的 .git属性:
# Custom for MFC
*.rc text eol=crlf diff=utf16
但是,如果我然后如果我运行 git diff
,显示如下:
iconv: C:/Users/Mahmoud/AppData/Local/Temp/IjLBZ8_OemKey.rc:104:1: incomplete character or shift sequence
使用 procmon,我能够在创建此进程时追踪到它:
sh -c "iconv.exe -f utf-16le -t utf-8 \"$@\"" "iconv.exe -f utf-16le -t utf-8" C:/Users/Mahmoud/AppData/Local/Temp/JLOkVa_OemKey.rc
...我实际上可以 运行 很好(尽管在实际文件中)。
有什么想法吗?
(请注意,我知道让 git 与 UTF-16 一起工作的各种解决方案。我专门试图解决这个问题,即为什么 iconv 本身可以工作,但它不会在 git 调用时工作。此外,此错误最初是在尝试 "duplicate" 问题的链接解决方案之一时遇到的。谢谢大家。)
仅使用 diff
,它应该有效:
*.rc diff=utf16
text
和 eol
导致 git 在将数据传递给 iconv 之前替换行尾,之后它不再是有效的 utf16,因为
git 最近开始了解编码,即实际上 iconv
现在在某种程度上是内置的。查看 gitattributes 文档,搜索 working-tree-encoding
[确保您的手册页匹配,因为这是全新的!]
如果(比如说)文件是 windows 机器上没有 bom 的 utf-16 则添加到您的 git 属性文件
some-utf-16-file.rc text working-tree-encoding=UTF-16LE eol=CRLF
如果 utf-16 little endinan (with bom) on *nix make it
some-utf-16-file.rc text working-tree-encoding=UTF-16 eol=LF
Git 2.21(2019 年 2 月)添加了一种新编码 UTF-16LE-BOM:旨在强制编码为小端字节中带有 BOM 的 UTF-16 order,不能直接用iconv
.
参见 commit aab2a1a (30 Jan 2019) by Torsten Bögershausen (tboegi
)。
(由 Junio C Hamano -- gitster
-- in commit 0fa3cc7 合并,2019 年 2 月 7 日)
Support working-tree-encoding "UTF-16LE-BOM"
Users who want UTF-16 files in the working tree set the
.gitattributes
like this:test.txt working-tree-encoding=UTF-16
The unicode standard itself defines 3 allowed ways how to encode UTF-16. The following 3 versions convert all back to 'g' 'i' 't' in UTF-8:
a) UTF-16, without BOM, big endian: $ printf "[=11=]0g[=11=]0i[=11=]0t" | iconv -f UTF-16 -t UTF-8 | od -c 0000000 g i t b) UTF-16, with BOM, little endian: $ printf "76g[=11=]0i[=11=]0t[=11=]0" | iconv -f UTF-16 -t UTF-8 | od -c 0000000 g i t c) UTF-16, with BOM, big endian: $ printf "67[=11=]0g[=11=]0i[=11=]0t" | iconv -f UTF-16 -t UTF-8 | od -c 0000000 g i t
Git uses
libiconv
to convert from UTF-8 in the index into ITF-16 in the working tree.
After a checkout, the resulting file has a BOM and is encoded in "UTF-16", in the version (c) above.
This is what iconv generates, more details follow below.
iconv
(andlibiconv
) can generate UTF-16, UTF-16LE or UTF-16BE:d) UTF-16 $ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c 0000000 376 377 [=12=] g [=12=] i [=12=] t e) UTF-16LE $ printf 'git' | iconv -f UTF-8 -t UTF-16LE | od -c 0000000 g [=12=] i [=12=] t [=12=] f) UTF-16BE $ printf 'git' | iconv -f UTF-8 -t UTF-16BE | od -c 0000000 [=12=] g [=12=] i [=12=] t
There is no way to generate version (b) from above in a Git working tree, but that is what some applications need.
(All fully unicode aware applications should be able to read all 3 variants, but in practice, we are not there yet).When producing UTF-16 as an output,
iconv
generates the big endian version with a BOM. (big endian is probably chosen for historical reasons).
iconv
can produce UTF-16 files with little endianess by using "UTF-16LE" as encoding, and that file does not have a BOM.Not all users (especially under Windows) are happy with this.
Some tools are not fully unicode aware and can only handle version (b).Today there is no way to produce version (b) with
iconv
(orlibiconv
).
Looking into the history oficonv
, it seems as if version (c) will be used in all futureiconv
versions (for compatibility reasons).Solve this dilemma and introduce a Git-specific "
UTF-16LE-BOM
".
libiconv can not handle the encoding, so Git pick it up, handles the BOM and uses libiconv to convert the rest of the stream. (UTF-16BE-BOM is added for consistency)