计算数据框中列表中每个单词的词频

Question

我创建了一个与特定类别相关的单词列表。例如：

care = ["safe", "peace", "empathy"]

我有一个包含演讲的数据框，平均包含 450 个单词。我使用这行代码计算了每个类别的匹配数：

df['Care'] = df['Speech'].apply(lambda x: len([val for val in x.split() if val in care]))

它给出了每个类别的匹配总数。

但是我需要查看列表中每个单词的出现频率。我尝试使用此代码来解决我的问题。

df.Tal.str.extractall('({})'.format('|'.join(auktoritet)))\
                           .iloc[:, 0].str.get_dummies().sum(level=0)

我尝试了不同的方法，但问题是我总是得到部分匹配项。例如锤子算作火腿。

关于如何解决这个问题有什么想法吗？

Answer 1

您可以转换以 1 作为第二个元素 ('word', 1) 的元组中的每个单词，然后对列表中的每个单词求和。

输出将是包含单词和频率的元组列表：

[('word1', 3), ('word2', 10) ... ]

这是主要思想。

Answer 2

您可以使用 collections 包中提供的计数器

from collections import Counter
word_count=Counter()
for line in df['speech']:
   for word in line.split(' '):
      word_count[word]+=1

它将存储 word_count 中所有单词的计数。然后你可以使用

word_count.most_common()

查看频率最高的词。

Answer 3

我以 Akash 答案为基础，设法获取存储在列表中的预先指定单词的频率，然后通过简单地添加一行在数据框中对它们进行计数。

from collections import Counter

word_count=Counter()
for line in df['Speech']:
   for word in line.split(' '):
       if word in care:
           word_count[word]+=1

word_count.most_common()

计算数据框中列表中每个单词的词频

Count word frequencies of each word in a list in dataframe

python

string-matching

word-frequency