text2vec:使用函数 create_vocabulary 后迭代词汇表
text2vec: Iterate over the vocabulary after using function create_vocabulary
使用 text2vec 包,我创建了一个词汇表。
vocab = create_vocabulary(it_0, ngram = c(2L, 2L))
vocab 看起来像这样
> vocab
Number of docs: 120
0 stopwords: ...
ngram_min = 2; ngram_max = 2
Vocabulary:
terms terms_counts doc_counts
1: knight_severely 1 1
2: movie_expect 1 1
3: recommend_watching 1 1
4: nuke_entire 1 1
5: sense_keeping 1 1
---
14467: stand_idly 1 1
14468: officer_loyalty 1 1
14469: willingness_die 1 1
14470: fight_bane 3 3
14471: bane_beginning 1 1
如何查看 terms_counts 列的范围?我需要这个,因为它在我的下一步修剪过程中对我有帮助
pruned_vocab = prune_vocabulary(vocab, term_count_min = <BLANK>)
以下代码可重现
library(text2vec)
text <- c(" huge fan superhero movies expectations batman begins viewing christopher
nolan production pleasantly shocked huge expectations dark knight christopher
nolan blew expectations dust happen film dark knight rises simply big expectations
blown production true cinematic experience behold movie exceeded expectations terms
action entertainment",
"christopher nolan outdone morning tired awake set film films genuine emotional
eartbeat felt flaw nolan films vision emotion hollow bought felt hero villain
alike christian bale typically brilliant batman felt bruce wayne heavily embraced
final installment bale added emotional depth character plot point astray dark knight")
it_0 = itoken( text,
tokenizer = word_tokenizer,
progressbar = T)
vocab = create_vocabulary(it_0, ngram = c(2L, 2L))
vocab
尝试range(vocab$vocab$terms_counts)
vocab
是一些元信息(文档数量、ngram 大小等)和主要 data.frame/data.table
的列表,其中包含字数和每个字数的文档。
如前所述,vocab$vocab
是您所需要的(data.table
有计数)。
你可以通过调用str(vocab)
:
找到内部结构
List of 5
$ vocab :Classes ‘data.table’ and 'data.frame': 82 obs. of 3 variables:
..$ terms : chr [1:82] "plot_point" "depth_character" "emotional_depth" "bale_added" ...
..$ terms_counts: int [1:82] 1 1 1 1 1 1 1 1 1 1 ...
..$ doc_counts : int [1:82] 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, ".internal.selfref")=<externalptr>
$ ngram : Named int [1:2] 2 2
..- attr(*, "names")= chr [1:2] "ngram_min" "ngram_max"
$ document_count: int 2
$ stopwords : chr(0)
$ sep_ngram : chr "_"
- attr(*, "class")= chr "text2vec_vocabulary"
使用 text2vec 包,我创建了一个词汇表。
vocab = create_vocabulary(it_0, ngram = c(2L, 2L))
vocab 看起来像这样
> vocab
Number of docs: 120
0 stopwords: ...
ngram_min = 2; ngram_max = 2
Vocabulary:
terms terms_counts doc_counts
1: knight_severely 1 1
2: movie_expect 1 1
3: recommend_watching 1 1
4: nuke_entire 1 1
5: sense_keeping 1 1
---
14467: stand_idly 1 1
14468: officer_loyalty 1 1
14469: willingness_die 1 1
14470: fight_bane 3 3
14471: bane_beginning 1 1
如何查看 terms_counts 列的范围?我需要这个,因为它在我的下一步修剪过程中对我有帮助
pruned_vocab = prune_vocabulary(vocab, term_count_min = <BLANK>)
以下代码可重现
library(text2vec)
text <- c(" huge fan superhero movies expectations batman begins viewing christopher
nolan production pleasantly shocked huge expectations dark knight christopher
nolan blew expectations dust happen film dark knight rises simply big expectations
blown production true cinematic experience behold movie exceeded expectations terms
action entertainment",
"christopher nolan outdone morning tired awake set film films genuine emotional
eartbeat felt flaw nolan films vision emotion hollow bought felt hero villain
alike christian bale typically brilliant batman felt bruce wayne heavily embraced
final installment bale added emotional depth character plot point astray dark knight")
it_0 = itoken( text,
tokenizer = word_tokenizer,
progressbar = T)
vocab = create_vocabulary(it_0, ngram = c(2L, 2L))
vocab
尝试range(vocab$vocab$terms_counts)
vocab
是一些元信息(文档数量、ngram 大小等)和主要 data.frame/data.table
的列表,其中包含字数和每个字数的文档。
如前所述,vocab$vocab
是您所需要的(data.table
有计数)。
你可以通过调用str(vocab)
:
List of 5
$ vocab :Classes ‘data.table’ and 'data.frame': 82 obs. of 3 variables:
..$ terms : chr [1:82] "plot_point" "depth_character" "emotional_depth" "bale_added" ...
..$ terms_counts: int [1:82] 1 1 1 1 1 1 1 1 1 1 ...
..$ doc_counts : int [1:82] 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, ".internal.selfref")=<externalptr>
$ ngram : Named int [1:2] 2 2
..- attr(*, "names")= chr [1:2] "ngram_min" "ngram_max"
$ document_count: int 2
$ stopwords : chr(0)
$ sep_ngram : chr "_"
- attr(*, "class")= chr "text2vec_vocabulary"