拆分并计算 Python 中给定字符串中的表情符号和单词

Question

对于给定的字符串，我正在尝试计算每个单词和表情符号的出现次数。对于仅由 1 个表情符号组成的表情符号，我已经完成了。问题是现在很多表情符号都是由几个表情符号组成的。

喜欢表情包‍‍‍由四个表情包组成-‍‍‍，还有人肤色的表情包，比如is等

问题归结为如何以正确的顺序拆分字符串，然后计算它们很容易。

有一些很好的问题解决了同样的问题，比如 link1 and ，但是其中 none 适用于通用解决方案（或者解决方案已经过时或者我无法理解）出）。

例如，如果字符串是 hello ‍ emoji hello ‍‍‍，那么我将有 {'hello':2, 'emoji':1, '‍‍‍':1, '‍':1} 我的字符串来自Whatsapp，并且都是用utf8编码的。

我有很多失败的尝试。帮助将不胜感激。

Answer 1

使用第 3 方 regex 模块，它支持识别字素簇（呈现为单个字符的 Unicode 代码点序列）：

>>> import regex
>>> s='‍‍‍'
>>> regex.findall(r'\X',s)
['\u200d\u200d\u200d', '']
>>> for c in regex.findall('\X',s):
...     print(c)
... 
‍‍‍

数一数：

>>> data = regex.findall(r'\X',s)
>>> from collections import Counter
>>> Counter(data)
Counter({'\u200d\u200d\u200d': 1, '': 1})

Answer 2

非常感谢Mark Tolonen。现在为了计算给定字符串中的单词和表情符号以及单词，我将使用 emoji.UNICOME_EMOJI 来确定什么是表情符号，什么不是（来自 emoji 包），然后删除来自表情符号的字符串。

目前不是一个理想的答案，但它有效，如果它会被改变我会编辑。

import emoji
import regex
def split_count(text):
    total_emoji = []
    data = regex.findall(r'\X',text)
    flag = False
    for word in data:
        if any(char in emoji.UNICODE_EMOJI for char in word):  
            total_emoji += [word] # total_emoji is a list of all emojis

    # Remove from the given text the emojis
    for current in total_emoji:
        text = text.replace(current, '') 

    return Counter(text.split() + total_emoji)


text_string = "here hello world hello‍‍‍"    
final_counter = split_count(text_string)

输出：

final_counter
Counter({'hello': 2,
         'here': 1,
         'world': 1,
         '\u200d\u200d\u200d': 1,
         '': 5,
         '': 1})

Answer 3

emoji.UNICODE_EMOJI是一个结构为

的字典

{'en': 
    {'': ':1st_place_medal:',
     '': ':2nd_place_medal:',
     '': ':3rd_place_medal:' 
... }
}

因此您需要使用 emoji.UNICODE_EMOJI['en'] 才能使上述代码正常工作。

拆分并计算 Python 中给定字符串中的表情符号和单词

Split and count emojis and words in a given string in Python

python

unicode

counter

python-3.x

emoji