如何在 Python 3 中读取文件名中包含西里尔字符的文件

Question

我正在尝试读取文件名包含西里尔字符的图像文件。

ls /home/atin/test
ОД Д.bmp

现在我正在尝试阅读 python 3

中的“ОДД.bmp”

image_path='/home/atin/test/ОД Д.bmp'
import matplotlib.pyplot as plt
sample_image=plt.imread(image_path)

但是我得到这个错误。

SystemError: <built-in function read_png> returned NULL without setting an error

但是os.listdir('/home/atin/test')给出了以下输出

['\udcd0\udc9e\udcd0\udc94 \udcd0\udc94.bmp']

如何将上述输出解码为 ОД Д.bmp 的原始文件名。我在这里使用 python 3.6 in ubuntu。

Answer 1

您的系统配置的区域设置不正确。在 Linux 上，Python 从语言环境中获取文件系统编解码器。来自 sys.getfilesystemencoding() documentation:

Return the name of the encoding used to convert between Unicode filenames and bytes filenames.

[...]

On Unix, the encoding is the locale encoding.

您有一个使用 UTF-8 的文件系统，但是 Python 没有正确读取数据。

导致UTF-8数据无法正确解码，出现解码错误，surrogateescape error handler kicks in, and it has 'preserved' the bytes as low surrogate codepoints.

您可以通过使用相同的错误处理程序编码为 UTF-8 来解决此问题：

>>> '\udcd0\udc9e\udcd0\udc94 \udcd0\udc94.bmp'.encode('utf8', 'surrogateescape')
b'\xd0\x9e\xd0\x94 \xd0\x94.bmp'

结果恰好是您的文件名的正确 UTF-8 编码：

>>> '\udcd0\udc9e\udcd0\udc94 \udcd0\udc94.bmp'.encode('utf8', 'surrogateescape').decode('utf8')
'ОД Д.bmp'

您至少要使用 LC_CTYPE=en_US.UTF-8 来避免此问题。在您的情况下，您似乎有 LC_CTYPE=UTF-8，这是无效的（您可以改用 LC_CTYPE=en_SG.UTF-8）。

另一种解决方法是使用字节路径：

image_path = '/home/atin/test/ОД Д.bmp'.encode('utf8')

How to read file with cyrillic characters in the filename in Python 3