在不知道编码的情况下使用 Python 读取文件

Question

我想从一个文件夹（使用 os.walk）读取所有文件并将它们转换为一种编码 (UTF-8)。问题是那些文件没有相同的编码。它们可以是 UTF-8、带 BOM 的 UTF-8、UTF-16。

有没有办法在不知道编码的情况下读取这些文件？

Answer 1

如果它确实总是这3个之一那么很容易。如果您可以使用 UTF-8 读取文件，那么它是可能 UTF-8。否则它将是 UTF-16。 Python 如果存在 BOM，也可以自动丢弃。

您可以使用 try ... except 块来尝试两者：

try:
    tryToConvertMyFile(from, to, 'utf-8-sig')
except UnicodeDecodeError:
    tryToConvertMyFile(from, to, 'utf-16')

如果还存在其他编码（如 ISO-8859-1），那就算了，没有 100% 可靠的方法来确定编码。但是你可以猜到——例如 Is there a Python library function which attempts to guess the character-encoding of some bytes?

Answer 2

您可以以二进制模式读取这些文件。还有 chardet 模块。使用它，您可以检测文件的编码并解码您获得的数据。虽然这个模块有限制。

举个例子：

from chardet import detect

with open('your_file.txt', 'rb') as ef:
    detect(ef.read())

在不知道编码的情况下使用 Python 读取文件

Read file with Python without knowing encoding

python

encoding

readfile

python-3.x