从 Python 中的一组句子中找出最常用的词

Question

我在 np.array 中有 5 个句子，我想找到最常见的 n 个出现的单词。例如，如果 n 是 3，我会想要 3 个最常见的词。我有一个例子：

0    oh i am she cool though might off her a brownie lol
1    so trash wouldnt do colors better tweet
2    love monkey brownie as much as a tweet
3    monkey get this tweet around i think
4    saw a brownie to make me some monkey

如果 n 为 3，我希望它打印以下词：brownie、monkey、tweet。有没有直接的方法来做这样的事情？

Answer 1

您可以在 CountVectorizer 的帮助下完成，如下所示：

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

A = np.array(["oh i am she cool though might off her a brownie lol", 
              "so trash wouldnt do colors better tweet", 
              "love monkey brownie as much as a tweet",
              "monkey get this tweet around i think",
              "saw a brownie to make me some monkey" ])

n = 3
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(A)

vocabulary = vectorizer.get_feature_names()
ind  = np.argsort(X.toarray().sum(axis=0))[-n:]

top_n_words = [vocabulary[a] for a in ind]

print (top_n_words)
['tweet', 'monkey', 'brownie']

希望对您有所帮助！

从 Python 中的一组句子中找出最常用的词

Find most common words from set of sentences in Python

python

numpy-ndarray