使用标准库将表情符号解码为两个（或更多）代码点

Question

我希望能够将表情符号解码为其对应的代码点，如 here 所示。我仅限于在 2.7 中使用标准库。

例如： -> U+1F1F2 U+1F1E9

我已经设法使用此代码获得了第一个代码点，但我不知道如何提取第二个代码点。一些表情符号有更多的代码点。

to_decode = u''
code = ord(to_decode[0])
if 0xd800 <= code <= 0xdbff:
    code = (code - 0xd800) * 1024 + (ord(to_decode[1]) - 0xdc00) +  + 0x010000

print(hex(code))

Answer 1

这是一种 hack，但您可以使用 unicode 字符串的 repr：

>>> repr(to_decode)
"u'\U0001f1f2\U0001f1e9'"

所以：

>>> hex(int(repr(to_decode)[4:12], 16))
'0x1f1f2'

和

>>> hex(int(repr(to_decode)[14:22], 16))
'0x1f1e9'

您必须扩展此方法以支持具有两个以上代码点的表情符号。您可以考虑将以上内容与 .split("\U").

结合使用

Answer 2

对于这个问题，您实际上需要 list() 它将 Unicode 字符分解为其组成代码点

to_decode = u''
list(to_decode)
['', '']

作为您可以用它做什么的示例，我创建了孟加拉字母表的 unicode 可视化

https://www.kaggle.com/jamesmcguigan/unicode-visualization-of-the-bengali-alphabet

Answer 3

encode and struct.unpack的组合可以满足您的需求。

>>> import struct
>>> b = to_decode.encode('utf_32_le')
>>> count = len(b) // 4
>>> count
2
>>> cp = struct.unpack('<%dI' % count, b)
>>> [hex(x) for x in cp]
['0x1f1f2', '0x1f1e9']

使用标准库将表情符号解码为两个（或更多）代码点

Decode emoji into two (or more) code points, using standard libraries

python

unicode

python-2.7

emoji