R:weightTf 和 weightTfIdf 产生相同的频繁词列表?

R: weightTf and weightTfIdf yield the same frequent word list?

我今天意识到,tf and/or tfidf 似乎在 R 中被破坏了。请参阅下面的示例。它使用手册中的数据,即 crude。我希望得到的频繁术语列表不相等。但他们是平等的。这永远不应该发生,对吧?

data("crude")

dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTf, stopwords = FALSE))
frequentTerms1 <- data.frame(findFreqTerms(dtm)[1:1000])
#View(frequentTerms1)


dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = FALSE))
frequentTerms2 <- data.frame(findFreqTerms(dtm)[1:1000])
#View(frequentTerms2)

frequentTerms1 == frequentTerms2

我的示例代码有没有错误?我从底层 tm 包的手册中复制了它并添加了一个 tf 案例以及比较。

感谢任何建议。

此致 托尔斯滕


编辑#1: 好的,谢谢 lukeA 的回答。这很有帮助。因此,"right"获取频繁项的方法是:

data("crude")

dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTf, stopwords = FALSE))
frequentTerms1 <- as.data.frame(sort(colSums(as.matrix(dtm)), decreasing = TRUE))
#View(frequentTerms1)


dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = FALSE))
frequentTerms2 <- as.data.frame(sort(colSums(as.matrix(dtm)), decreasing = TRUE))
#View(frequentTerms2)

frequentTerms1 == frequentTerms2

现在,两个列表都不同了。

默认情况下,findFreqTerms 检查转置文档项矩阵(= 项文档矩阵)的行总和是否大于或等于 0 且小于或等于无穷大。对于使用频率加权和 tfidf 加权的所有项都是如此。这是一个例子:

txts <- c("Hello super World", "Hello World World")
corp <- VCorpus(VectorSource(txts))
tf <- DocumentTermMatrix(corp, control=list(weighting=weightTf))
tfidf <- DocumentTermMatrix(corp, control=list(weighting=weightTfIdf))

all(findFreqTerms(tf)==findFreqTerms(tfidf))
# [1] TRUE

现在,如果您指定另一个最低频率:

findFreqTerms(tf, lowfreq = 1)
# [1] "hello" "super" "world"
findFreqTerms(tfidf, lowfreq = 0.33)
# [1] "super"

as.matrix(tf)
#     Terms
# Docs hello super world
#    1     1     1     1
#    2     1     0     2

as.matrix(tfidf)
#     Terms
# Docs hello     super world
#    1     0 0.3333333     0
#    2     0 0.0000000     0