从 sys.stdin 读取管道输入时如何防止 "UnicodeDecodeError"？

Question

我正在读取一些主要是 HEX 输入的 Python3 脚本。然而，系统设置为使用 UTF-8 并且当从 Bash shell 管道进入脚本时，我保持得到以下 UnicodeDecodeError error:

UnicodeDecodeError: ('utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte)

根据其他 SO 答案，我在 Python3 中使用 sys.stdin.read() 来读取管道输入，如下所示：

import sys
...
isPipe = 0
if not sys.stdin.isatty() :
    isPipe = 1
    try:
        inpipe = sys.stdin.read().strip()
    except UnicodeDecodeError as e:
        err_unicode(e)
...

使用这种方式管道时有效:

# echo "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | some.py
<output all ok!>

但是，使用原始格式不会：

# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1"

    ▒▒▒
   ▒▒

# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | some.py
UnicodeDecodeError: ('utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte)

并且还尝试了其他有希望的 SO 答案：

# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "open(1,'w').write(open(0).read())"
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "from io import open; open(1,'w').write(open(0).read())"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

据我目前所知，当您的终端遇到 UTF-8 sequence, it is expecting 后跟 1-3 个其他字节时，如下所示：

UTF-8 is a variable width character encoding capable of encoding all valid code points in Unicode using one to four 8-bit bytes. So anything after the leading byte (first UTF-8 character in range of 0xC2 - 0xF4) to be followed by 1-3 continuation bytes, in the range 0x80 - 0xBF.

但是，我不能总是确定我的输入流来自哪里，它很可能是原始数据，而不是上面的 ASCII 十六进制版本。所以我需要以某种方式处理这个原始输入。

我查看了一些备选方案，例如：

使用codecs.decode
将 open("myfile.jpg", "rb", buffering=0) 与 raw i/o
使用 bytes

bytes.decode(encoding="utf-8", errors="ignore")

或仅使用 open(...)

但我不知道他们是否或如何能够像我想要的那样读取管道输入流。

如何让我的脚本也处理原始字节流？

PS。是的，我已经阅读了大量类似的 SO 问题，但其中 none 足以处理此 UTF-8 输入错误。最好的是 this one.

这不是重复的。

Answer 1

这是一种像文件一样读取二进制标准输入的 hacky 方法：

import sys

with open(sys.stdin.fileno(), mode='rb', closefd=False) as stdin_binary:
    raw_input = stdin_binary.read()
try:
    # text is the string formed by decoding raw_input as unicode
    text = raw_input.decode('utf-8')
except UnicodeDecodeError:
    # raw_input is not valid unicode, do something else with it

Answer 2

我终于通过 not 使用 sys.stdin!

解决了这个问题

相反，我使用了 with open(0, 'rb')。其中：

0是相当于stdin的文件指针。
'rb' 正在使用 binary 模式 reading.

这似乎可以避免 system 试图解释管道中的 locale 字符的问题。在看到以下内容有效并返回正确的（不可打印的）字符后，我有了这个想法：

echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "with open(0, 'rb') as f: x=f.read(); import sys; sys.stdout.buffer.write(x);"

▒▒▒
   ▒▒

所以为了正确读取任何管道数据，我使用了：

if not sys.stdin.isatty() :
    try:
        with open(0, 'rb') as f: 
            inpipe = f.read()

    except Exception as e:
        err_unknown(e)        
    # This can't happen in binary mode:
    #except UnicodeDecodeError as e:
    #    err_unicode(e)
...

这会将您的管道数据读入 python 字节字符串 。

下一个问题是确定管道数据是来自字符串（如echo "BADDATA0"）还是来自二进制流。后者可以由 echo -ne "\xBA\xDD\xAT\xA0" 模拟，如 OP 所示。在我的例子中，我只是使用 RegEx 来查找越界的非 ASCII 字符。

if inpipe :
    rx = re.compile(b'[^0-9a-fA-F ]+') 
    r = rx.findall(inpipe.strip())
    if r == [] :
        print("is probably a HEX ASCII string")
    else:
        print("is something else, possibly binary")

当然可以做得更好更聪明。（欢迎评论！）

附录：（来自 here）

mode is an optional string that specifies the mode in which the file is opened. It defaults to r which means open for reading in text mode. In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.) The default mode is 'r' (open for reading text, synonym of 'rt'). For binary read-write access, the mode w+b opens and truncates the file to 0 bytes. r+b opens the file without truncation.

... Python distinguishes between binary and text I/O. Files opened in binary mode (including b in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when t is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.

If closefd is False and a file descriptor rather than a filename was given, the underlying file descriptor will be kept open when the file is closed. If a filename is given, closefd must be True (the default) otherwise an error will be raised.

Answer 3

使用sys.stdin.buffer.raw代替sys.stdin

从 sys.stdin 读取管道输入时如何防止 "UnicodeDecodeError"？

How to prevent "UnicodeDecodeError" when reading piped input from sys.stdin?

python

stdin

pipe

character-encoding

python-3.x