将表情符号的 unicode 字符串表示形式转换为 python 中的 unicode 表情符号

Question

我在 Spark（PySpark 和 Pandas）上使用 Python2 来分析有关表情符号使用情况的数据。我有一个像 u'u+1f375' 或 u'u+1f618' 这样的字符串，我想分别将其转换为 </code> 和 <code>。

我已经阅读了其他几个 SO 帖子和 unicode HOWTO，试图掌握 encode 和 decode 但无济于事。

这没有用：

decode_udf = udf(lambda x: x.decode('unicode-escape'))
foo = emojis.withColumn('decoded_emoji', decode_udf(emojis.emoji))
Result: decoded_emoji=u'u+1f618'

这最终只能一次性完成，但当我将它应用到我的 RDD 时却失败了。

def rename_if_emoji(pattern):
  """rename the element name of dataframe with emoji"""

  if pattern.lower().startswith("u+"):
    emoji_string = ""
    EMOJI_PREFIX = "u+"
    for part_org in pattern.lower().split(" "):
      part = part_org.strip();
      if (part.startswith(EMOJI_PREFIX)):
        padding = "0" * (8 + len(EMOJI_PREFIX) - len(part)) 
        codepoint = '\U' + padding + part[len(EMOJI_PREFIX):]
        print("codepoint: " + codepoint)
        emoji_string += codepoint.decode('unicode-escape')
        print("emoji_string: " + emoji_string)
    return emoji_string
  else:
    return pattern

rename_if_emoji_udf = udf(rename_if_emoji)

错误：UnicodeEncodeError: 'ascii' codec can't encode character u'\U0001f618' in position 14: ordinal not in range(128)

Answer 1

能否正确打印表情符号取决于所使用的IDE/terminal。由于 Python 2 的 print 将 Unicode 字符串编码为终端的编码，您将在不受支持的终端上获得 UnicodeEncodeError。您还需要字体支持。你的错误在print。您已正确解码，但理想情况下您的输出设备应支持 UTF-8。

该示例简化了解码过程。我打印字符串的 repr() 以防终端未配置为支持正在打印的字符。

import re

def replacement(m):
    '''Assume the matched characters are hexadecimal, convert to integer,
       format appropriately, and decode back to Unicode.
    '''
    i = int(m.group(1),16)
    return '\U{:08X}'.format(i).decode('unicode-escape')

def replace(s):
    '''Replace all u+nnnn strings with the Unicode equivalent.
    '''
    return re.sub(ur'u\+([0-9a-fA-F]+)',replacement,s)

s = u'u+1f618 u+1f375'
t = replace(s)
print repr(t)
print t

输出（在 UTF-8 IDE 上）：

u'\U0001f618 \U0001f375'

将表情符号的 unicode 字符串表示形式转换为 python 中的 unicode 表情符号

Convert unicode string representation of emoji to unicode emoji in python

python

unicode

emoji

pyspark