用 R 提取 ngram

Question

我正在尝试使用 ngramrr 包从 nirvana 文本中提取 3-grams。

require(ngramrr)
require(tm)
require(magrittr)

nirvana <- c("hello hello hello how low", "hello hello hello how low",
             "hello hello hello how low", "hello hello hello",
             "with the lights out", "it's less dangerous", "here we are now",
             "entertain us", "i feel stupid", "and contagious", "here we are now", 
             "entertain us", "a mulatto", "an albino", "a mosquito", "my libido",
             "yeah", "hey yay")

ngramrr(nirvana[1], ngmax = 3)

Corpus(VectorSource(nirvana))

我得到这个结果：

[1] "hello"      "hello"    "hello"              "how"  "low"       "hello hello"  "hello hello"      
[8] "hello how"  "how low"  "hello hello hello"  "hello hello how"  "hello how low"

我想知道如何构建 TermDocumentMatrix 术语是 tri-grams 列表。

谢谢

Answer 1

我上面的评论差不多完成了，但是是这样的：

nirvana %>% tokens(ngrams = 1:3) %>% # generate tokens
  dfm %>% # generate dfm
  convert(to = "tm") %>% # convert to tm's document-term-matrix
  t # transpose it to term-document-matrix

用 R 提取 ngram

Extract ngrams with R

r

text-mining