Tarfile/Zipfile extractall() 改变一些文件的文件名

Question

您好，我目前正在开发一个工具，该工具必须提取一些 .tar 文件。

大部分情况下效果很好，但我有一个问题：

一些 .tar 和 .zip 文件的名称包含 "illegal" 个字符（f.ex“:”）。该程序必须在 windows 台机器上运行，所以我必须处理这个问题。

如果提取的输出中包含“:”或其他非法 windows 字符，我是否可以更改某些文件的名称。

我当前的实现：

def read_zip(filepath, extractpath):
    with zipfile.ZipFile(filepath, 'r') as zfile:
        contains_bad_char = False
        for finfo in zfile.infolist():
            if ":" in finfo.filename:
                contains_bad_char = True
        if not contains_bad_char:
            zfile.extractall(path=extractpath)


def read_tar(filepath, extractpath):
    with tarfile.open(filepath, "r:gz") as tar:
        contains_bad_char = False
        for member in tar.getmembers():
            if ":" in member.name:
                contains_bad_char = True
        if not contains_bad_char:
            tar.extractall(path=extractpath)

所以目前我只是忽略了这些输出，这并不理想。

为了更好地描述我的要求，我可以提供一个小例子：

file_with_files.tar -> small_file_1.txt
                    -> small_file_2.txt
                    -> annoying:file_1.txt
                    -> annoying:file_1.txt

应该提取到

file_with_files -> small_file_1.txt
                -> small_file_2.txt
                -> annoying_file_1.txt
                -> annoying_file_1.txt

是遍历压缩文件中的每个文件对象并逐个提取的唯一解决方案还是有更优雅的解决方案？

Answer 1

根据[Python.Docs]: ZipFile.extract(member, path=None, pwd=None)：

On Windows illegal characters (:, <, >, |, ", ?, and *) replaced by underscore (_).

所以，事情已经处理好了：

>>> import os
>>> import zipfile
>>>
>>> os.getcwd()
'e:\Work\Dev\Whosebug\q055340013'
>>> os.listdir()
['arch.zip']
>>>
>>> zf = zipfile.ZipFile("arch.zip")
>>> zf.namelist()
['file0.txt', 'file:1.txt']
>>> zf.extractall()
>>> zf.close()
>>>
>>> os.listdir()
['arch.zip', 'file0.txt', 'file_1.txt']

快速浏览 TarFile（来源和 doc）并没有发现任何类似的东西（我不会感到很惊讶如果没有，因为 .tar 格式主要用于 Nix)，所以你必须手动完成.事情并不像我预期的那么简单，因为 TarFile 不提供以不同名称提取成员的可能性，就像 ZipFile 那样.
无论如何，这是一段代码（我有 ZipFile 和 TarFile 作为灵感或灵感来源）：

code00.py:

#!/usr/bin/env python

import sys
import os
import tarfile


def unpack_tar(filepath, extractpath=".", compression_flag="*"):
    win_illegal = ':<>|"?*'
    table = str.maketrans(win_illegal, '_' * len(win_illegal))
    with tarfile.open(filepath, "r:" + compression_flag) as tar:
        for member in tar.getmembers():
            #print(member, member.isdir(), member.name, member.path)
            #print(type(member))
            if member.isdir():
                os.makedirs(member.path.translate(table), exist_ok=True)
            else:
                with open(os.path.join(extractpath, member.path.translate(table)), "wb") as fout:
                    fout.write(tarfile.ExFileObject(tar, member).read())


def main(*argv):
    unpack_tar("arch00.tar")


if __name__ == "__main__":
    print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")),
                                                   64 if sys.maxsize > 0x100000000 else 32, sys.platform))
    rc = main(*sys.argv[1:])
    print("\nDone.")
    sys.exit(rc)

请注意，上面的代码适用于简单的 .tar 文件（具有简单的成员，包括目录）。

已提交 [Python.Bugs]: tarfile: handling Windows (path) illegal characters in archive member names。
我不知道它的结果会是什么，因为我提交了几个更严重的问题（以及对它们的修复）（在我的 PoV ), 但由于种种原因，被拒绝了。

Tarfile/Zipfile extractall() 改变一些文件的文件名

Tarfile/Zipfile extractall() changing filename of some files

python

zip

file

tar