R：weightTf 和 weightTfIdf 产生相同的频繁词列表？

Question

我今天意识到，tf and/or tfidf 似乎在 R 中被破坏了。请参阅下面的示例。它使用手册中的数据，即 crude。我希望得到的频繁术语列表不相等。但他们是平等的。这永远不应该发生，对吧？

data("crude")

dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTf, stopwords = FALSE))
frequentTerms1 <- data.frame(findFreqTerms(dtm)[1:1000])
#View(frequentTerms1)


dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = FALSE))
frequentTerms2 <- data.frame(findFreqTerms(dtm)[1:1000])
#View(frequentTerms2)

frequentTerms1 == frequentTerms2

我的示例代码有没有错误？我从底层 tm 包的手册中复制了它并添加了一个 tf 案例以及比较。

感谢任何建议。

此致托尔斯滕

编辑#1： 好的，谢谢 lukeA 的回答。这很有帮助。因此，"right"获取频繁项的方法是：

data("crude")

dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTf, stopwords = FALSE))
frequentTerms1 <- as.data.frame(sort(colSums(as.matrix(dtm)), decreasing = TRUE))
#View(frequentTerms1)


dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = FALSE))
frequentTerms2 <- as.data.frame(sort(colSums(as.matrix(dtm)), decreasing = TRUE))
#View(frequentTerms2)

frequentTerms1 == frequentTerms2

现在，两个列表都不同了。

Answer 1

默认情况下，findFreqTerms 检查转置文档项矩阵（= 项文档矩阵）的行总和是否大于或等于 0 且小于或等于无穷大。对于使用频率加权和 tfidf 加权的所有项都是如此。这是一个例子：

txts <- c("Hello super World", "Hello World World")
corp <- VCorpus(VectorSource(txts))
tf <- DocumentTermMatrix(corp, control=list(weighting=weightTf))
tfidf <- DocumentTermMatrix(corp, control=list(weighting=weightTfIdf))

all(findFreqTerms(tf)==findFreqTerms(tfidf))
# [1] TRUE

现在，如果您指定另一个最低频率：

findFreqTerms(tf, lowfreq = 1)
# [1] "hello" "super" "world"
findFreqTerms(tfidf, lowfreq = 0.33)
# [1] "super"

与

as.matrix(tf)
#     Terms
# Docs hello super world
#    1     1     1     1
#    2     1     0     2

as.matrix(tfidf)
#     Terms
# Docs hello     super world
#    1     0 0.3333333     0
#    2     0 0.0000000     0

R：weightTf 和 weightTfIdf 产生相同的频繁词列表？

R: weightTf and weightTfIdf yield the same frequent word list?

r

tf-idf