处理来自 sys.stdin 的 Unicode 有困难

Question

这让我有点发疯了。从我最近几天的研究中可以清楚地看出，unicode 是一个复杂的话题。但这是我不知道如何解决的行为。

如果我从磁盘读取一个包含非 ASCII 字符的文件并将其写回文件，一切都按计划进行。然而，当我从 sys.stdin 读取同一个文件时，id 不起作用并且非 ASCII 字符没有正确编码。示例代码在这里：

# -*- coding: utf-8 -*-
import sys

with open("testinput.txt", "r") as ifile:
    lines = ifile.read()

with open("testout1.txt", "w") as ofile:
    for line in lines:
        ofile.write(line)

with open("testout2.txt", "w") as ofile:
    for line in sys.stdin:
        ofile.write(line)

输入文件testinput.txt是这样的：

を
Sōten_Kōro

当我从命令行运行将脚本作为 cat testinput.txt | python test.py 时，我分别得到以下输出：

testout1.txt:

を Sōten_Kōro

testout2.txt:

??? S??ten_K??ro

任何解决此问题的想法都会有很大帮助。谢谢。保罗.

Answer 1

原因是你走了一条不该走的捷径。

您应该始终定义编码。所以当你读取文件时，你应该指定你正在读取UTF-8，或者任何时候。或者只是明确表示您正在读取二进制文件。

在您的情况下，python 解释器在读取文件时将使用 UTF-8 作为标准编码，因为这是 Linux 和 macos 中的默认编码。

但是当您从标准输入读取时，默认值由语言环境编码或环境变量定义。

参考How to change the stdin encoding on python如何解决。本回答只是为了说明原因。

Answer 2

多谢指点。基于@GiacomoCatenazzi 的回答和参考，我已经实现了以下实现：

# -*- coding: utf-8 -*-
import sys
import codecs

with open("testinput.txt", "r") as ifile:
    lines = ifile.read()

with open("testout1.txt", "w") as ofile:
    for line in lines:
        ofile.write(line)

UTF8Reader = codecs.getreader('utf-8')
sys.stdin = UTF8Reader(sys.stdin)
with open("testout2.txt", "w") as ofile:
    for line in sys.stdin:
        ofile.write(line.encode('utf-8'))

但是我不确定为什么在使用 codecs.getreader 后需要再次编码？

保罗

处理来自 sys.stdin 的 Unicode 有困难

Difficulty with dealing with Unicode from sys.stdin

windows

unicode

stdin

python-2.7