如何使用 QUANTEDA,R 获取从数据集中删除的停用词类型列表
How to get a list of the types of stopwords removed from dataset using QUANTEDA, R
我正在使用 R 中的 quanteda
处理文本数据集。我从该数据集创建了一个语料库,然后我创建了一个 dfm,其中删除了所有英文标点符号和停用词,使用以下方法:
dfm_nostp <- dfm(data, remove_punct = TRUE, remove=c(stopwords("english")))
有没有一种方法可以检查我从 quanteda
的数据集中删除了多少种标点符号和停用词?
非常感谢
试试这个:
library("quanteda")
## Package version: 1.5.2
summarize_texts_extended <- function(x, stop_words = stopwords("en")) {
toks <- tokens(x) %>%
tokens_tolower()
# total tokens
ndocs <- ndoc(x)
ntoksall <- ntoken(toks)
ntoks <- sum(ntoksall)
# punctuation
toks <- tokens(toks, remove_punct = TRUE, remove_symbols = FALSE)
npunct <- ntoks - sum(ntoken(toks))
# symbols and emoji
toks <- tokens(toks, remove_symbols = TRUE)
nsym <- ntoks - npunct - sum(ntoken(toks))
# numbers
toks <- tokens(toks, remove_numbers = TRUE)
nnumbers <- ntoks - npunct - nsym - sum(ntoken(toks))
# words
nwords <- ntoks - npunct - nsym - nnumbers
# stopwords
dfmat <- dfm(toks)
nfeats <- nfeat(dfmat)
dfmat <- dfm_remove(dfmat, stop_words)
nstopwords <- nfeats - nfeat(dfmat)
list(
total_tokens = ntoks,
total_punctuation = npunct,
total_symbols = nsym,
total_numbers = nnumbers,
total_words = nwords,
total_stopwords = nstopwords
)
}
它returns,作为一个列表,你想要的数量:
summarize_texts_extended(data_corpus_inaugural)
## $total_tokens
## [1] 149138
##
## $total_punctuation
## [1] 13852
##
## $total_symbols
## [1] 4
##
## $total_numbers
## [1] 85
##
## $total_words
## [1] 135197
##
## $total_stopwords
## [1] 136
我正在使用 R 中的 quanteda
处理文本数据集。我从该数据集创建了一个语料库,然后我创建了一个 dfm,其中删除了所有英文标点符号和停用词,使用以下方法:
dfm_nostp <- dfm(data, remove_punct = TRUE, remove=c(stopwords("english")))
有没有一种方法可以检查我从 quanteda
的数据集中删除了多少种标点符号和停用词?
非常感谢
试试这个:
library("quanteda")
## Package version: 1.5.2
summarize_texts_extended <- function(x, stop_words = stopwords("en")) {
toks <- tokens(x) %>%
tokens_tolower()
# total tokens
ndocs <- ndoc(x)
ntoksall <- ntoken(toks)
ntoks <- sum(ntoksall)
# punctuation
toks <- tokens(toks, remove_punct = TRUE, remove_symbols = FALSE)
npunct <- ntoks - sum(ntoken(toks))
# symbols and emoji
toks <- tokens(toks, remove_symbols = TRUE)
nsym <- ntoks - npunct - sum(ntoken(toks))
# numbers
toks <- tokens(toks, remove_numbers = TRUE)
nnumbers <- ntoks - npunct - nsym - sum(ntoken(toks))
# words
nwords <- ntoks - npunct - nsym - nnumbers
# stopwords
dfmat <- dfm(toks)
nfeats <- nfeat(dfmat)
dfmat <- dfm_remove(dfmat, stop_words)
nstopwords <- nfeats - nfeat(dfmat)
list(
total_tokens = ntoks,
total_punctuation = npunct,
total_symbols = nsym,
total_numbers = nnumbers,
total_words = nwords,
total_stopwords = nstopwords
)
}
它returns,作为一个列表,你想要的数量:
summarize_texts_extended(data_corpus_inaugural)
## $total_tokens
## [1] 149138
##
## $total_punctuation
## [1] 13852
##
## $total_symbols
## [1] 4
##
## $total_numbers
## [1] 85
##
## $total_words
## [1] 135197
##
## $total_stopwords
## [1] 136