在文档词频中查找常用词及其值
Find Frequent Word and its Value in Document Term Frequency
所以我必须从 DTM 中找到最频繁出现的单词及其值。
library('tm')
library("SnowballC")
my.text.location "C:/Users/mrina/OneDrive/Documents/../"
apapers <- VCorpus(DirSource(my.text.location)) class(apapers)
apapers <- tm_map(apapers, removeNumbers)
apapers <- tm_map(apapers, removePunctuation)
apapers <- tm_map(apapers, stemDocument, language ="en")
这是为了清理语料库和下面创建 DTM 并找到频率的语料库。
ptm.tf <- DocumentTermMatrix(apapers)
dim(ptm.tf)
findFreqTerms(ptm.tf)
有没有办法把频繁词和频率值放在一起?
如果您不介意使用其他包,这应该可行(而不是创建 DTM 对象):
library('tm')
library("SnowballC")
my.text.location "C:/Users/mrina/OneDrive/Documents/../"
apapers <- VCorpus(DirSource(my.text.location))
class(apapers)
apapers <- tm_map(apapers, removeNumbers)
apapers <- tm_map(apapers, removePunctuation)
apapers <- tm_map(apapers, stemDocument, language ="en")
# new lines here
library(qdap)
freq_terms(apapers) ^
由 reprex package (v0.2.0) 创建于 2018-09-28。
findFreqTerms
只不过是在稀疏矩阵上使用行和。函数使用slam的row_sums
。为了保持单词的计数,我们可以使用相同的函数。安装tm时会安装slam包,所以加载slam或者通过slam::
调用函数都可以使用。使用 slam 中的函数更好,因为它们适用于稀疏矩阵。 Base rowsums
会将稀疏矩阵转换为密集矩阵,后者速度较慢且占用更多内存。
# your code.....
ptm.tf <- DocumentTermMatrix(apapers)
# using col_sums since it is a document term matrix. If it is a term document matrix use row_sums
frequency <- slam::col_sums(ptm.tf)
# Filtering like findFreqTerms. Find words that occur 10 times or more.
frequency <- frequency[frequency >= 10]
# turn into data.frame if needed:
frequency_df <- data.frame(words = names(frequency ), freq = frequency , row.names = NULL)
所以我必须从 DTM 中找到最频繁出现的单词及其值。
library('tm')
library("SnowballC")
my.text.location "C:/Users/mrina/OneDrive/Documents/../"
apapers <- VCorpus(DirSource(my.text.location)) class(apapers)
apapers <- tm_map(apapers, removeNumbers)
apapers <- tm_map(apapers, removePunctuation)
apapers <- tm_map(apapers, stemDocument, language ="en")
这是为了清理语料库和下面创建 DTM 并找到频率的语料库。
ptm.tf <- DocumentTermMatrix(apapers)
dim(ptm.tf)
findFreqTerms(ptm.tf)
有没有办法把频繁词和频率值放在一起?
如果您不介意使用其他包,这应该可行(而不是创建 DTM 对象):
library('tm')
library("SnowballC")
my.text.location "C:/Users/mrina/OneDrive/Documents/../"
apapers <- VCorpus(DirSource(my.text.location))
class(apapers)
apapers <- tm_map(apapers, removeNumbers)
apapers <- tm_map(apapers, removePunctuation)
apapers <- tm_map(apapers, stemDocument, language ="en")
# new lines here
library(qdap)
freq_terms(apapers) ^
由 reprex package (v0.2.0) 创建于 2018-09-28。
findFreqTerms
只不过是在稀疏矩阵上使用行和。函数使用slam的row_sums
。为了保持单词的计数,我们可以使用相同的函数。安装tm时会安装slam包,所以加载slam或者通过slam::
调用函数都可以使用。使用 slam 中的函数更好,因为它们适用于稀疏矩阵。 Base rowsums
会将稀疏矩阵转换为密集矩阵,后者速度较慢且占用更多内存。
# your code.....
ptm.tf <- DocumentTermMatrix(apapers)
# using col_sums since it is a document term matrix. If it is a term document matrix use row_sums
frequency <- slam::col_sums(ptm.tf)
# Filtering like findFreqTerms. Find words that occur 10 times or more.
frequency <- frequency[frequency >= 10]
# turn into data.frame if needed:
frequency_df <- data.frame(words = names(frequency ), freq = frequency , row.names = NULL)