如何写入与打印格式相同的文件？

Question

TL;DR

尝试将字符串写入文件时发生以下错误：

代码

logfile.write(cli_args.last_name)

输出

UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-9: ordinal not in range(128)

但这行得通：

代码

print(cli_args.last_name)

输出

Pérez

为什么？

完整上下文

我制作了一个脚本，它从 Linux CLI 接收数据，对其进行处理，最后使用提供的数据创建 Zendesk 票证。它有点像 CLI API，因为在我的脚本之前有一个更大的系统，它有一个带有表单的 Web 界面，用户可以在其中填写字段的值，然后被替换到 CLI 脚本中。例如：

myscript.py --first_name '_first_name_' --last_name '_last_name_'

脚本运行没有任何问题，直到昨天更新了网络。我认为他们更改了与字符集或编码相关的内容。

我用 F-strings 做了一些简单的日志记录，方法是打开一个文件并写一些信息性消息以防万一失败，这样我就可以回去检查它发生的地方。还使用 argparse 模块读取 CLI 属性。示例：

logfile.write(f"\tChecking for opened tickets for user '{cli_args.first_name} {cli_args.last_name}'\n")

网站更新后出现如下错误：

UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-9: ordinal not in range(128)

进行一些故障排除后我发现这是因为某些用户输入的名称带有重音符号，例如 Carlos Pérez.

我需要脚本再次运行并为这样的输入做好准备，所以我通过检查 Web 控制台输入表单中的 HTTP headers 来寻找答案，发现它使用 Content-Type: text/html; charset=UTF-8;我的第一次尝试是将 CLI 参数中传递的 str 编码为 utf-8 并使用相同的编解码器再次对其进行解码，但没有成功。

第二次尝试时，我检查了 Python 文档 str.encode() and bytes.decode()。所以我尝试了这个：

logfile.write(
    "\tChecking for opened tickets for user "
    f"'{cli_args.first_name.encode(encoding='utf-8', errors='ignore').decode('utf-8')} "
    f"{cli_args.last_name.encode(encoding='utf-8', errors='ignore').decode('utf-8')}'"
)

它起作用了，但是去掉了重音标记的字母，所以 Carlos Pérez 变成了 Carlos Prez，在这种情况下对我没有用，我需要完整的输入。

作为一个孤注一掷的举动，我尝试打印相同的 F-string 我试图写入日志文件，令我惊讶的是它起作用了。它在没有任何 encoding/decoding 过程的情况下打印到控制台 Carlos Pérez。

打印如何工作？以及为什么尝试写入文件不起作用？但最重要的是如何写入与打印格式相同的文件？

编辑 1 @MarkTolonen

尝试了以下方法：

logfile = open("/usr/share/pandora_server/util/plugin/plugin_mcm/sandbox/755bug.txt", mode="a", encoding="utf8")
logfile.write(cli_args.body)
logfile.close()

输出：

Traceback (most recent call last): File "/usr/share/pandora_server/util/plugin/plugin_mcm/sandbox/ticket_query_app.py", line 414, in main() File "/usr/share/pandora_server/util/plugin/plugin_mcm/sandbox/ticket_query_app.py", line 81, in main logfile.write(cli_args.body) UnicodeEncodeError: 'utf-8' codec can't encode characters in position 8-9: surrogates not allowed

编辑 2

我设法找到导致问题的文本：

if __name__ == "__main__":
    string = (
        "Buenos d\udcc3\udcadas,\r\n\r\n"
        "Mediante  monitoreo autom\udcc3\udca1tico se ha detectado un evento fuera de lo normal:\r\n\r\n"
        "Descripci\udcc3\udcb3n del evento: _snmp_f13_\r\n"
        "Causas sugeridas del evento: _snmp_f14_\r\n"
        "Posible afectaci\udcc3\udcb3n del evento: _snmp_f15_\r\n"
        "Validaciones de bajo impacto: _snmp_f16_\r\n"
        "Fecha y hora del evento: 2021-07-14 17:47:51\r\n\r\n"
        "Saludos."
    )

    # Output: Text with the unicodes translated
    print(string)

    # Output: "UnicodeEncodeError: 'utf-8' codec can't encode characters in position 8-9: surrogates not allowed"
    with open(file="test.log", mode="w", encoding="utf8") as logfile:
        logfile.write(string)

Answer 1

答案是 open 的 encoding 参数。观察：

Last login: Wed Jul 14 15:05:24 2021 from 50.126.68.34
[timrprobocom@jared-ingersoll ~]$ python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open('x.txt','a')
>>> g = open('y.txt','a',encoding='utf-8')
>>> s = "spades \u2660 spades"
>>> f.write(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u2660' in position 7: ordinal not in range(128)
>>> g.write(s)
15
>>>
[timrprobocom@jared-ingersoll ~]$ hexdump -C y.txt
00000000  73 70 61 64 65 73 20 e2  99 a0 20 73 70 61 64 65  |spades ... spade|
*
00000011

Answer 2

上游似乎配置有误。您的 string 似乎是由 decode 操作生成的，但编码错误，errors='surrogateescape' 错误处理。从显示的数据来看，解码操作似乎试图将 UTF-8 编码的文本解码为 ASCII。

errors='surrogateescape' 是编码在 decode 操作期间处理无效字节的一种方式。错误处理程序在转换为 Unicode 字符串时用 U+DC80..U+DCFF 范围内的部分代理替换无效字节，并且可以通过执行 encode 来反转该过程以取回原始字节字符串errors='surrogateescape' 和相同的编码。

您 string 中的部分代理与 decode(encoding='ascii', errors='surrogateescape') 调用在给定数据实际以 UTF-8 编码时产生的模式相匹配 - 代理都在 [=20= 范围内] 使用，它们对应的字节构成有效的 UTF-8。在下面的代码中，我恢复了原始字节，然后将它们正确解码为 UTF-8。一旦 Unicode 字符串有效，就可以使用 encoding='utf8'.

将其写入日志文件

string = (
    "Buenos d\udcc3\udcadas,\r\n\r\n"
    "Mediante  monitoreo autom\udcc3\udca1tico se ha detectado un evento fuera de lo normal:\r\n\r\n"
    "Descripci\udcc3\udcb3n del evento: _snmp_f13_\r\n"
    "Causas sugeridas del evento: _snmp_f14_\r\n"
    "Posible afectaci\udcc3\udcb3n del evento: _snmp_f15_\r\n"
    "Validaciones de bajo impacto: _snmp_f16_\r\n"
    "Fecha y hora del evento: 2021-07-14 17:47:51\r\n\r\n"
    "Saludos."
)

fixed = string.encode('ascii',errors='surrogateescape').decode('utf8')
print(fixed)

with open(file="test.log", mode="w", encoding="utf8") as logfile:
    logfile.write(fixed)

您可以在 PEP 383 中阅读有关代理转义的更多信息。

如何写入与打印格式相同的文件？

How can I write to a file with the same formatting as print?

python

string

file-io

python-3.x

TL;DR

完整上下文

编辑 1 @MarkTolonen

编辑 2