查看 python 中 sklearn 的 tf-idf 分数

Question

我按照示例 here 使用 sklearn 计算 TF-IDF 值。

我的代码如下

from sklearn.feature_extraction.text import TfidfVectorizer
myvocabulary = ['life', 'learning']
corpus = {1: "The game of life is a game of everlasting learning", 2: "The unexamined life is not worth living", 3: "Never stop learning"}
tfidf = TfidfVectorizer(vocabulary = myvocabulary, ngram_range = (1,3))
tfs = tfidf.fit_transform(corpus.values())

我想为 corpus.

中的 3 个文档计算 life 和 learning 这两个词的 tf-idf 值

根据我所指的文章（参见下面的 Table），我的示例应该得到以下值。

但是，我从我的代码中得到的值是完全不同的。请帮助我找出我的代码中的错误以及如何修复它。

Answer 1

要点是，在构建词频矩阵之前，您不应将词汇限制为只有两个词（'life'、'learning'）。如果这样做，所有其他词都将被忽略，并且会影响词频计数。

如果想使用 sklearn 获得与示例中完全相同的数字，还需要考虑其他几个步骤：

示例中的特征是一元组（单个词）所以我有设置 ngram_range=(1,1).
该示例对术语使用与 sklearn 不同的归一化频率部分（术语计数按文档长度归一化在示例中，而 sklearn 默认使用原始术语计数）。因此，我计算并归一化了词频在计算idf部分之前分开。
例子中idf部分的归一化也不是 sklearn 的默认值。这可以通过以下方式进行调整以匹配示例将 smooth_idf 设置为 false。
Sklearn 的矢量化器默认丢弃一个单词字符，但示例中保留了此类单词。在代码中下面，我修改了 token_pattern 以允许 1 个字符单词.

最终的 tfidf 矩阵是通过将归一化计数乘以 idf 向量得到的。

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import normalize
import pandas as pd

corpus = {1: "The game of life is a game of everlasting learning", 2: "The unexamined life is not worth living", 3: "Never stop learning"}

cvect = CountVectorizer(ngram_range=(1,1), token_pattern='(?u)\b\w+\b')
counts = cvect.fit_transform(corpus.values())
normalized_counts = normalize(counts, norm='l1', axis=1)

tfidf = TfidfVectorizer(ngram_range=(1,1), token_pattern='(?u)\b\w+\b', smooth_idf=False)
tfs = tfidf.fit_transform(corpus.values())
new_tfs = normalized_counts.multiply(tfidf.idf_)

feature_names = tfidf.get_feature_names()
corpus_index = [n for n in corpus]
df = pd.DataFrame(new_tfs.T.todense(), index=feature_names, columns=corpus_index)

print(df.loc[['life', 'learning']])

然而，在实践中很少需要这样的修改。通常直接使用 TfidfVectorizer 即可获得良好的效果。

查看 python 中 sklearn 的 tf-idf 分数

Check the tf-idf scores of sklearn in python

python

tf-idf

scikit-learn