fileinput.hook_compressed 有时给我字符串，其他时候给我字节

Question

我正在尝试从多个文件中读取行。有些是 gzip 压缩的，有些是纯文本文件。在 Python 2.7 中，我一直在使用以下代码并且有效：

for line in fileinput.input(filenames, openhook=fileinput.hook_compressed):
    match = REGEX.match(line)
    if (match):
        # do things with line...

现在我转移到 Python 3.8，它仍然可以处理纯文本文件，但是当它遇到 gzip 文件时，我得到以下错误：

TypeError: cannot use a string pattern on a bytes-like object

解决此问题的最佳方法是什么？我知道我可以检查 line 是否是字节对象并将其解码为字符串，但我宁愿使用一些标志来自动始终将行迭代为字符串，如果可能的话；而且，我更愿意编写适用于 Python 2 和 3 的代码。

Answer 1

fileinput.input does fundamentally different things depending on whether it gets a gzipped file or not. For text files, it opens with regular open, which effectively opens in text mode by default. For gzip.open，默认模式为二进制，对未知内容的压缩文件比较敏感

binary-only限制是fileinput.FileInput. From the code of the __init__方法人为强加的：

  # restrict mode argument to reading modes
   if mode not in ('r', 'rU', 'U', 'rb'):
       raise ValueError("FileInput opening mode must be one of "
                        "'r', 'rU', 'U' and 'rb'")
   if 'U' in mode:
       import warnings
       warnings.warn("'U' mode is deprecated",
                     DeprecationWarning, 2)
   self._mode = mode

这为您提供了两种解决方法。

选项 1

在__init__之后设置_mode属性。为避免在您的使用中添加额外的代码行，您可以子 class fileinput.FileInput 并直接使用 class：

class TextFileInput(fileinput.FileInput):
    def __init__(*args, **kwargs):
        if 'mode' in kwargs and 't' in kwargs['mode']:
            mode = kwargs.pop['mode']
        else:
            mode = ''
        super().__init__(*args, **kwargs)
        if mode:
            self._mode = mode

for line in TextFileInput(filenames, openhook=fileinput.hook_compressed, mode='rt'):
    ...

选项 2

与未记录的 leading-underscore 打交道是相当棘手的，因此您可以为 zip 文件创建一个自定义挂钩。这实际上很简单，因为您可以使用 fileinput.hook_compressed 的代码作为模板：

def my_hook_compressed(filename, mode):
    if 'b' not in mode:
        mode += 't'
    ext = os.path.splitext(filename)[1]
    if ext == '.gz':
        import gzip
        return gzip.open(filename, mode)
    elif ext == '.bz2':
        import bz2
        return bz2.open(filename, mode)
    else:
        return open(filename, mode)

选项 3

最后，您始终可以将字节解码为 unicode 字符串。这显然不是更好的选择。

Answer 2

将答案扩展 Mad Physicist 以包括 xz 和 zst 扩展。

def my_hook_compressed(filename, mode):
    """hook for fileinput so we can also handle compressed files seamlessly"""
    if 'b' not in mode:
        mode += 't'
    ext = os.path.splitext(filename)[1]
    if ext == '.gz':
        import gzip
        return gzip.open(filename, mode)
    elif ext == '.bz2':
        import bz2
        return bz2.open(filename, mode)
    elif ext == '.xz':
        import lzma
        return lzma.open(filename, mode)
    elif ext == '.zst':
        import zstandard, io
        compressed = open(filename, 'rb')
        decompressor = zstandard.ZstdDecompressor()
        stream_reader = decompressor.stream_reader(compressed)
        return io.TextIOWrapper(stream_reader)
    else:
        return open(filename, mode)

我没有在 2.7 上测试过，但这适用于 3.8+

for line in fileinput.input(filenames, openhook=my_hook_compressed):
    ...

fileinput.hook_compressed 有时给我字符串，其他时候给我字节

fileinput.hook_compressed gives me strings sometimes, bytes other times

python

file

python-3.8