如何使用 Python/pandas 对文本中的表情符号进行编码（用于计算最常出现的 them/finding 等）？

Question

我在 Python 和 pandas 工作，我有一个数据框，其中一列包含包含表情符号的短语，例如 "when life gives you s, make lemonade" 或 "Catch a falling ⭐️ and put it in your pocket".并不是所有的短语都有表情符号，如果有，它可能在短语的任何地方（不仅仅是开头或结尾）。我想浏览每个文本，基本上计算每个表情符号出现的频率，出现次数最多的表情符号等。我不确定如何实际 process/recognize 表情符号。如果我浏览专栏中的每个文本，我将如何识别表情符号，以便收集所需信息，例如计数、最大值等。

Answer 1

假设你有这样一个数据框

import pandas as pd
from collections import defaultdict

df = pd.DataFrame({'phrases' : ["Smiley emoticon rocks! I like you.\U0001f601", 
                                "Catch a falling ⭐️ and put it in your pocket"]})

产生

                 phrases
0   Smiley emoticon rocks! I like you.
1   Catch a falling ⭐️ and put it in your pocket

你可以这样做：

# Dictionary storing emoji counts 
emoji_count = defaultdict(int)
for i in df['phrases']:
    for emoji in re.findall(u'[\U0001f300-\U0001f650]|[\u2000-\u3000]', i):
        emoji_count[emoji] += 1

print (emoji_count)

注意我在re.findall(u'[\U0001f300-\U0001f650]|[\u2000-\u3000', i)中更改了范围。

替代部分是处理不同的 unicode 组，但你应该明白了。

在 Python 2.x 中，您可以使用

将表情符号转换为 unicode

unicode('⭐️ ', 'utf-8') # u'\u2b50\ufe0f' - output

输出 :

defaultdict(int, {'⭐': 1, '': 1, '': 1})

那个正则表达式是从这个 link.

无耻地偷来的

如何使用 Python/pandas 对文本中的表情符号进行编码（用于计算最常出现的 them/finding 等）？

How to encode emojis that are in text with Python/pandas (for counting them/finding most frequently occurring, etc)?

python

emoji

pandas