如何在下面的tfidf模型中获取最具代表性的特征？

Question

你好，我有以下列表：

listComments = ["comment1","comment2","comment3",...,"commentN"]

我创建了一个 tfidf 向量化器来从我的评论中获取一个模型，如下所示：

tfidf_vectorizer = TfidfVectorizer(min_df=10,ngram_range=(1,3),analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(listComments)

现在为了更多地了解我的模型我想获得最具代表性的特征，我尝试了：

print("these are the features :",tfidf_vectorizer.get_feature_names())
print("the vocabulary :",tfidf_vectorizer.vocabulary_)

这给了我一个我认为我的模型用于矢量化的单词列表：

these are the features : ['10', '10 days', 'red', 'car',...]

the vocabulary : {'edge': 86, 'local': 96, 'machine': 2,...}

但是我想找到一种方法来获得 30 个最具代表性的特征，我的意思是在我的 tfidf 模型中达到最高值的词，逆频率最高的词，我正在阅读文档但是我无法找到此方法我非常感谢您在此问题上提供的帮助，在此先感谢，

Answer 1

如果您想获得关于 idf 分数的词汇列表，您可以使用 idf_ 属性和 argsort 它。

# create an array of feature names
feature_names = np.array(tfidf_vectorizer.get_feature_names())

# get order
idf_order = tfidf_vectorizer.idf_.argsort()[::-1]

# produce sorted idf word
feature_names[idf_order]

如果您想获得每个文档的 tfidf 分数的排序列表，您可以做类似的事情。

# get order for all documents based on tfidf scores
tfidf_order = tfidf.toarray().argsort()[::-1]

# produce words
feature_names[tfidf_order]

如何在下面的tfidf模型中获取最具代表性的特征？

how to get the most representative features in the following tfidf model?

tf-idf

scikit-learn