从文本文件编码表情符号 (Python) 的最佳且干净的方法

Question

提到这个问题：

我正在寻找将表情符号从这种 \ud83d\ude04 类型编码为这种 (Unicode) - \U0001f604 的最佳方式，因为目前，除了创建 python 方法将通过文本文件并替换表情符号编码。

这是可以转换的字符串：

作为假设，可能需要逐行传递文本并进行转换？？

潜在想法：

with open(ff_name, 'rb') as source_file:
  with open(target_file_name, 'w+b') as dest_file:
    contents = source_file.read()
    dest_file.write(contents.decode('utf-16').encode('utf-8'))

Answer 1

\ud83d\ude04 是字符的 utf16 表示 SMILING FACE WITH OPEN MOUTH AND SMILING EYES (U+1F604) 您需要将其解码为字符，然后将字符的代码点转换为十六进制字符串。我不太了解 Python 无法告诉您如何执行此操作。

Answer 2

因此，我假设您以某种方式获得了原始 ASCII 字符串，其中包含转义序列和 UTF-16 代码单元，形成代理对，并且您（无论出于何种原因）想要将其转换为 \UXXXXXXXX-格式。

因此，今后我假设您的输入（字节！）如下所示：

weirdInput = "hello \ud83d\ude04".encode("latin_1")

现在您想执行以下操作：

以 \uXXXX 事物转换为 UTF-16 代码单元的方式解释字节。有raw_unicode_escapes，但不幸的是它需要一个单独的通道来修复代理对（我不知道为什么，老实说）
修复代理对，将数据转换为有效的 UTF-16
解码为有效的 UTF-16
再次编码为"raw_unicode_escape"
解码回旧 latin_1，仅由具有 unicode 转义序列的良好旧 ASCII 组成，格式为 \UXXXXXXXX。

像这样：

  output = (weirdInput
    .decode("raw_unicode_escape")
    .encode('utf-16', 'surrogatepass')
    .decode('utf-16')
    .encode("raw_unicode_escape")
    .decode("latin_1")
  )

现在如果你print(output)，你会得到：

hello \U0001f604

请注意，如果您在中间阶段停止：

smiley = (weirdInput
  .decode("raw_unicode_escape")
  .encode('utf-16', 'surrogatepass')
  .decode('utf-16')
)

然后你会得到一个带有笑脸的 unicode 字符串：

print(smiley)
# hello

完整代码：

weirdInput = "hello \ud83d\ude04".encode("latin_1")

output = (weirdInput
  .decode("raw_unicode_escape")
  .encode('utf-16', 'surrogatepass')
  .decode('utf-16')
  .encode("raw_unicode_escape")
  .decode("latin_1")
)


smiley = (weirdInput
  .decode("raw_unicode_escape")
  .encode('utf-16', 'surrogatepass')
  .decode('utf-16')
)

print(output)
# hello \U0001f604

print(smiley)
# hello

从文本文件编码表情符号 (Python) 的最佳且干净的方法

Best and clean way to Encode Emojis (Python) from text file

python

unicode

encode

text-files

emoji