Python 2 假定不同的源代码编码

Question

我注意到在没有源代码编码声明的情况下，Python 2 解释器假定源代码使用 scripts 和 标准输入以 ASCII 编码:

$ python test.py  # where test.py holds the line: print u'é'
  File "test.py", line 1
SyntaxError: Non-ASCII character '\xc3' in file test.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

$ echo "print u'é'" | python
  File "/dev/fd/63", line 1
SyntaxError: Non-ASCII character '\xc3' in file /dev/fd/63 on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

它使用 -m 模块和 -c 命令在 ISO-8859-1 中编码旗帜：

$ python -m test  # where test.py holds the line: print u'é'
Ã©

$ python -c "print u'é'"
Ã©

记录在哪里？

将此与 Python 3 进行对比，后者始终假定源代码以 UTF-8 编码，因此在四种情况下打印 é。

注。 – 我在 macOS 10.13 和 Ubuntu Linux 17.10 上的 CPython 2.7.14 上测试了这个，控制台编码设置为 UTF-8.

Answer 1

-c 和 -m 开关，最终 ^(*) 运行 exec statement or the compile() function 提供的代码，两者其中取Latin-1源码：

The first expression should evaluate to either a Unicode string, a Latin-1 encoded string, an open file object, a code object, or a tuple.

这没有记录，它是一个实现细节，可能会或可能不会被视为错误。

不过我不认为这是值得修复的东西，而且 Latin-1 是 ASCII 的超集，所以几乎没有丢失。 -c 和 -m 中的代码如何处理已在 Python 3 中得到清理，并且在那里更加一致；使用 -c 传入的代码使用当前语言环境进行解码，并且使用 -m 开关加载的模块像往常一样默认为 UTF-8。

^(*) 如果您想知道所使用的确切实现，请从 Py_Main() function in Modules/main.c 开始，它同时处理 -c 和 -m 作为：

if (command) {
    sts = PyRun_SimpleStringFlags(command, &cf) != 0;
    free(command);
} else if (module) {
    sts = RunModule(module, 1);
    free(module);
}

-c是通过PyRun_SimpleStringFlags() function, which in turn calls PyRun_StringFlags()执行的。当您使用 exec 时，字节串对象也会传递给 PyRun_StringFlags()，然后假定源代码包含 Latin-1-encoded 字节。
-m 使用模式设置为 exec 的 RunModule() function to pass the module name to the private function _run_module_as_main() in the runpy module, which uses pkgutil.get_loader() to load the module metadata, and fetches the module code object with the loader.get_code() function on the PEP 302 loader; if no cached bytecode is available then the code object is produced by using the compile() function。

Python 2 假定不同的源代码编码

Python 2 assumes different source code encodings

python

ascii

iso-8859-1

character-encoding

python-internals