用 R 提取 ngram
Extract ngrams with R
我正在尝试使用 ngramrr
包从 nirvana 文本中提取 3-grams
。
require(ngramrr)
require(tm)
require(magrittr)
nirvana <- c("hello hello hello how low", "hello hello hello how low",
"hello hello hello how low", "hello hello hello",
"with the lights out", "it's less dangerous", "here we are now",
"entertain us", "i feel stupid", "and contagious", "here we are now",
"entertain us", "a mulatto", "an albino", "a mosquito", "my libido",
"yeah", "hey yay")
ngramrr(nirvana[1], ngmax = 3)
Corpus(VectorSource(nirvana))
我得到这个结果:
[1] "hello" "hello" "hello" "how" "low" "hello hello" "hello hello"
[8] "hello how" "how low" "hello hello hello" "hello hello how" "hello how low"
我想知道如何构建 TermDocumentMatrix
术语是 tri-grams
列表。
谢谢
我上面的评论差不多完成了,但是是这样的:
nirvana %>% tokens(ngrams = 1:3) %>% # generate tokens
dfm %>% # generate dfm
convert(to = "tm") %>% # convert to tm's document-term-matrix
t # transpose it to term-document-matrix
我正在尝试使用 ngramrr
包从 nirvana 文本中提取 3-grams
。
require(ngramrr)
require(tm)
require(magrittr)
nirvana <- c("hello hello hello how low", "hello hello hello how low",
"hello hello hello how low", "hello hello hello",
"with the lights out", "it's less dangerous", "here we are now",
"entertain us", "i feel stupid", "and contagious", "here we are now",
"entertain us", "a mulatto", "an albino", "a mosquito", "my libido",
"yeah", "hey yay")
ngramrr(nirvana[1], ngmax = 3)
Corpus(VectorSource(nirvana))
我得到这个结果:
[1] "hello" "hello" "hello" "how" "low" "hello hello" "hello hello"
[8] "hello how" "how low" "hello hello hello" "hello hello how" "hello how low"
我想知道如何构建 TermDocumentMatrix
术语是 tri-grams
列表。
谢谢
我上面的评论差不多完成了,但是是这样的:
nirvana %>% tokens(ngrams = 1:3) %>% # generate tokens
dfm %>% # generate dfm
convert(to = "tm") %>% # convert to tm's document-term-matrix
t # transpose it to term-document-matrix