R:weightTf 和 weightTfIdf 产生相同的频繁词列表?
R: weightTf and weightTfIdf yield the same frequent word list?
我今天意识到,tf
and/or tfidf
似乎在 R 中被破坏了。请参阅下面的示例。它使用手册中的数据,即 crude
。我希望得到的频繁术语列表不相等。但他们是平等的。这永远不应该发生,对吧?
data("crude")
dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTf, stopwords = FALSE))
frequentTerms1 <- data.frame(findFreqTerms(dtm)[1:1000])
#View(frequentTerms1)
dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = FALSE))
frequentTerms2 <- data.frame(findFreqTerms(dtm)[1:1000])
#View(frequentTerms2)
frequentTerms1 == frequentTerms2
我的示例代码有没有错误?我从底层 tm
包的手册中复制了它并添加了一个 tf
案例以及比较。
感谢任何建议。
此致
托尔斯滕
编辑#1:
好的,谢谢 lukeA
的回答。这很有帮助。因此,"right"获取频繁项的方法是:
data("crude")
dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTf, stopwords = FALSE))
frequentTerms1 <- as.data.frame(sort(colSums(as.matrix(dtm)), decreasing = TRUE))
#View(frequentTerms1)
dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = FALSE))
frequentTerms2 <- as.data.frame(sort(colSums(as.matrix(dtm)), decreasing = TRUE))
#View(frequentTerms2)
frequentTerms1 == frequentTerms2
现在,两个列表都不同了。
默认情况下,findFreqTerms
检查转置文档项矩阵(= 项文档矩阵)的行总和是否大于或等于 0 且小于或等于无穷大。对于使用频率加权和 tfidf 加权的所有项都是如此。这是一个例子:
txts <- c("Hello super World", "Hello World World")
corp <- VCorpus(VectorSource(txts))
tf <- DocumentTermMatrix(corp, control=list(weighting=weightTf))
tfidf <- DocumentTermMatrix(corp, control=list(weighting=weightTfIdf))
all(findFreqTerms(tf)==findFreqTerms(tfidf))
# [1] TRUE
现在,如果您指定另一个最低频率:
findFreqTerms(tf, lowfreq = 1)
# [1] "hello" "super" "world"
findFreqTerms(tfidf, lowfreq = 0.33)
# [1] "super"
与
as.matrix(tf)
# Terms
# Docs hello super world
# 1 1 1 1
# 2 1 0 2
as.matrix(tfidf)
# Terms
# Docs hello super world
# 1 0 0.3333333 0
# 2 0 0.0000000 0
我今天意识到,tf
and/or tfidf
似乎在 R 中被破坏了。请参阅下面的示例。它使用手册中的数据,即 crude
。我希望得到的频繁术语列表不相等。但他们是平等的。这永远不应该发生,对吧?
data("crude")
dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTf, stopwords = FALSE))
frequentTerms1 <- data.frame(findFreqTerms(dtm)[1:1000])
#View(frequentTerms1)
dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = FALSE))
frequentTerms2 <- data.frame(findFreqTerms(dtm)[1:1000])
#View(frequentTerms2)
frequentTerms1 == frequentTerms2
我的示例代码有没有错误?我从底层 tm
包的手册中复制了它并添加了一个 tf
案例以及比较。
感谢任何建议。
此致 托尔斯滕
编辑#1:
好的,谢谢 lukeA
的回答。这很有帮助。因此,"right"获取频繁项的方法是:
data("crude")
dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTf, stopwords = FALSE))
frequentTerms1 <- as.data.frame(sort(colSums(as.matrix(dtm)), decreasing = TRUE))
#View(frequentTerms1)
dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = FALSE))
frequentTerms2 <- as.data.frame(sort(colSums(as.matrix(dtm)), decreasing = TRUE))
#View(frequentTerms2)
frequentTerms1 == frequentTerms2
现在,两个列表都不同了。
默认情况下,findFreqTerms
检查转置文档项矩阵(= 项文档矩阵)的行总和是否大于或等于 0 且小于或等于无穷大。对于使用频率加权和 tfidf 加权的所有项都是如此。这是一个例子:
txts <- c("Hello super World", "Hello World World")
corp <- VCorpus(VectorSource(txts))
tf <- DocumentTermMatrix(corp, control=list(weighting=weightTf))
tfidf <- DocumentTermMatrix(corp, control=list(weighting=weightTfIdf))
all(findFreqTerms(tf)==findFreqTerms(tfidf))
# [1] TRUE
现在,如果您指定另一个最低频率:
findFreqTerms(tf, lowfreq = 1)
# [1] "hello" "super" "world"
findFreqTerms(tfidf, lowfreq = 0.33)
# [1] "super"
与
as.matrix(tf)
# Terms
# Docs hello super world
# 1 1 1 1
# 2 1 0 2
as.matrix(tfidf)
# Terms
# Docs hello super world
# 1 0 0.3333333 0
# 2 0 0.0000000 0