从多语言 Unicode 文本中删除表情符号

Question

我正在尝试从 Unicode 文本中删除只是表情符号。我尝试了各种方法，但其中 none 完全删除了所有表情符号/笑脸。例如：

解决方案 1：

def remove_emoji(self, string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

留在下面的例子中：

Input: తెలంగాణ రియల్ ఎస్టేట్ 
Output: తెలంగాణ రియల్ ఎస్టేట్

再次尝试，方案二：

def deEmojify(self, inputString):
    returnString = ""
    for character in inputString:
        try:
            character.encode("ascii")
            returnString += character
        except UnicodeEncodeError:
            returnString += ''
    return returnString

删除任何非英语字符的结果：

 Input: Testరియల్ ఎస్టేట్ A.P&T.S. 
 Output: Test  A.P&T.S.

它不仅删除了所有表情符号，还删除了非英语字符，因为 character.encode("ascii")；我的非英语输入无法编码为 ASCII。

有没有办法从国际 Unicode 文本中正确删除表情符号？

Answer 1

正则表达式已过时。它似乎涵盖了 Unicode 8.0 之前定义的表情符号（因为 U+1F91D HANDSHAKE 是在 Unicode 9.0 中添加的）。另一种方法只是一种非常低效的强制编码为 ASCII 的方法，这在仅删除表情符号时很少是您想要的（使用 text.encode('ascii', 'ignore').decode('ascii') 可以更轻松有效地实现）。

如果您需要更新的正则表达式，请从 a package that is actively trying to keep up-to-date on Emoji 中获取一个；它特别支持生成这样的正则表达式：

import emoji

def remove_emoji(text):
    return emoji.get_emoji_regexp().sub(u'', text)

该软件包目前是 Unicode 11.0 的最新版本，并且具有可快速更新到未来版本的基础结构。您的项目所要做的就是在有新版本时进行升级。

使用您的示例输入进行演示：

>>> print(remove_emoji(u'తెలంగాణ రియల్ ఎస్టేట్ '))
తెలంగాణ రియల్ ఎస్టేట్ 
>>> print(remove_emoji(u'Testరియల్ ఎస్టేట్ A.P&T.S. '))
Testరియల్ ఎస్టేట్ A.P&T.S.

请注意，正则表达式适用于 Unicode 文本，对于 Python 2 确保您已从 str 解码为 unicode，对于Python3、从bytes到str先.

如今，表情符号是复杂的野兽。以上将删除 完整、有效的表情符号 。如果你有 'incomplete' 表情符号组件，例如 skin-tone codepoints (meant to be combined with specific Emoji only) then you'll have much more trouble removing those. The skin-tone codepoints are easy (just remove those 5 codepoints afterwards), but there is a whole host of combinations that are made up of innocent characters such as ♀ U+2640 FEMALE SIGN or ♂ U+2642 MALE SIGN together with variant selectors and the U+200D ZERO-WIDTH JOINER 在 其他上下文中也有特定含义 ，并且你不能只是将它们正则表达式，除非你不这样做不介意使用梵文、卡纳达语或 CJK 表意文字破坏文本，仅举几个例子。

也就是说，以下 Unicode 11.0 代码点可能可以安全删除（基于过滤 Emoji_Component Emoji-data designation）：

20E3          ;  (⃣)     combining enclosing keycap
FE0F          ; ()        VARIATION SELECTOR-16
1F1E6..1F1FF  ; (..)  regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF  ; (..)  light skin tone..dark skin tone
1F9B0..1F9B3  ; (..) red-haired..white-haired
E0020..E007F  ; (..)      tag space..cancel tag

可以通过创建一个新的正则表达式来匹配它们来删除：

import re
try:
    uchr = unichr  # Python 2
    import sys
    if sys.maxunicode == 0xffff:
        # narrow build, define alternative unichr encoding to surrogate pairs
        # as unichr(sys.maxunicode + 1) fails.
        def uchr(codepoint):
            return (
                unichr(codepoint) if codepoint <= sys.maxunicode else
                unichr(codepoint - 0x010000 >> 10 | 0xD800) +
                unichr(codepoint & 0x3FF | 0xDC00)
            )
except NameError:
    uchr = chr  # Python 3

# Unicode 11.0 Emoji Component map (deemed safe to remove)
_removable_emoji_components = (
    (0x20E3, 0xFE0F),             # combining enclosing keycap, VARIATION SELECTOR-16
    range(0x1F1E6, 0x1F1FF + 1),  # regional indicator symbol letter a..regional indicator symbol letter z
    range(0x1F3FB, 0x1F3FF + 1),  # light skin tone..dark skin tone
    range(0x1F9B0, 0x1F9B3 + 1),  # red-haired..white-haired
    range(0xE0020, 0xE007F + 1),  # tag space..cancel tag
)
emoji_components = re.compile(u'({})'.format(u'|'.join([
    re.escape(uchr(c)) for r in _removable_emoji_components for c in r])),
    flags=re.UNICODE)

然后更新上面的remove_emoji()函数来使用它：

def remove_emoji(text, remove_components=False):
    cleaned = emoji.get_emoji_regexp().sub(u'', text)
    if remove_components:
        cleaned = emoji_components.sub(u'', cleaned)
    return cleaned

Answer 2

如果您使用 regex 库而不是 re 库，您可以访问 Unicode 属性，然后您可以将函数更改为

def remove_emoji(self, string):
    emoji_pattern = re.compile("[\P{L}&&\P{D}&&\P{Z}&&\P{M}]", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

这将保留所有字母、数字、分隔符和标记（重音符号）

从多语言 Unicode 文本中删除表情符号

Remove Emoji's from multilingual Unicode text

python

regex

string

unicode

emoji