输出格式不正确

Question

我正在尝试运行我正在阅读的一本书中的 PowerShell 中的一些代码用于学习 Python(3.7)，但我的输出与预期不符，我看不到我在哪里出错了。

这是代码：

from sys import argv

script, input_file = argv

def print_all(f):
    print(f.read())

def rewind(f):
    f.seek(0)

def print_a_line(line_count, f):
    print(line_count, f.readline())

current_file = open(input_file)

print("First let's print the whole file:\n")

print_all(current_file)

print("Now let's rewind, kind of like a tape.")

rewind(current_file)

print("Let's print three lines:")

current_line = 1
print_a_line(current_line, current_file)

current_line = current_line + 1
print_a_line(current_line, current_file)

current_line = current_line + 1
print_a_line(current_line, current_file)

输出的格式似乎出了问题。

如您所见，每行的开头都添加了一个 y，而在应该打印 1 行的部分，它跳过了第二行。

文件 test.txt 包含：

this is line 1
this is line 2
this is line 3

Ps。我知道有更有效的方法来执行其中一些操作，但这不是重点。

Answer 1

您文件的前两个字节是 0xFF 和 0xFE。这是一个"byte order mark" that indicates that the encoding of the file is Unicode 16 bit little-endian. Take a look at the third row in the table in the wikipedia page；它显示了与您在输出中看到的相同的两个字符 ÿþ。

要读取文件，请在 open 调用中提供参数 encoding='UTF-16'：

current_file = open(input_file, encoding='UTF-16')

Answer 2

问题是您正在尝试将 UTF-16-LE 数据（来自文件、powershell 管道或其他东西）视为 UTF-8、Latin-1 或 cp1252 或类似数据。

解决方案大概是这样的：

current_file = open(input_file, encoding='utf-16')

更一般地说，您应该知道您正在阅读的文件类型。 UTF-16-with-BOM 文本文件、UTF-8 文本文件和 whatever-my-OEM-code-page-is 文本文件都是不同的东西，您需要传递正确的编码。否则，您只是要求 Python 选择默认值并祈祷。

要了解为什么会发生这种情况：

你只有普通的英文字符，它们都可以用 ASCII 编码。

在 UTF-16 中，每个字符占用两个字节。一个字节与该字符的ASCII值相同，另一个为0。

在 UTF-8、Latin-1 或其他 ASCII 兼容编码中，这些字符中的每一个占用一个字节，与 ASCII 中的一个字节相同。

因此，如果您尝试将 UTF-16 当作 UTF-8 或 Latin-1 来读取，则每个偶数字节都是您想要的字符，每个奇数字节都是 0，这表示 NUL 字符.根据您打印内容的方式，此 NUL 字符可能不可见，或打印为空格，甚至截断字符串。

开头的额外两个字符是 BOM 中的两个字节——这就是您应该如何区分 UTF-16-LE 和 UTF-16-BE 的方式——被读取为 Latin-1 字符。 BOM是一个特殊字符U+FEFF，在UTF-16-LE中表现为\xFF和\xFE两个字节，而在UTF-16-LE中表现为\xFE和\xFF UTF-16-BE。但是那些相同的字节，在 Latin-1 中，是您看到的 y-with-umlaut 和 thorn 字符。

输出格式不正确

Output is formatted incorrectly

python

python-3.7