有没有办法在gensim的tfidf模型中设置min_df和max_df？

Question

我像这样使用 gensim 的 tdidf 模型：

from gensim import corpora, models

dictionary = corpora.Dictionary(some_corpus)
mapped_corpus = [dictionary.doc2bow(text)
                 for text in some_corpus]

tfidf = models.TfidfModel(mapped_corpus)

现在我想应用阈值来删除出现频率太高 (max_df) 和频率太低 (min_df) 的字词。我知道 scikit 的 CountVectorizer 允许你这样做，但我似乎无法找到如何在 gensim 的 tfidf 中设置这些阈值。有人可以帮忙吗？

Answer 1

您可以使用

过滤字典

dictionary.filter_extremes(no_below=min_df, no_above=rel_max_df)

请注意，no_below 期望标记必须出现的文档的最小数量，而 no_above 期望最大相对频率，例如0.5。之后，您可以使用过滤后的词典构建语料库。根据 gensim docs 也可以只用字典构造一个 TfidfModel。

有没有办法在gensim的tfidf模型中设置min_df和max_df？

Is there a way to set min_df and max_df in gensim's tfidf model?

tf-idf

gensim