如何使用 QUANTEDA，R 获取从数据集中删除的停用词类型列表

Question

我正在使用 R 中的 quanteda 处理文本数据集。我从该数据集创建了一个语料库，然后我创建了一个 dfm，其中删除了所有英文标点符号和停用词，使用以下方法：

dfm_nostp <- dfm(data, remove_punct = TRUE, remove=c(stopwords("english")))

有没有一种方法可以检查我从 quanteda 的数据集中删除了多少种标点符号和停用词？

非常感谢

Answer 1

试试这个：

library("quanteda")
## Package version: 1.5.2

summarize_texts_extended <- function(x, stop_words = stopwords("en")) {
  toks <- tokens(x) %>%
    tokens_tolower()

  # total tokens
  ndocs <- ndoc(x)
  ntoksall <- ntoken(toks)
  ntoks <- sum(ntoksall)

  # punctuation
  toks <- tokens(toks, remove_punct = TRUE, remove_symbols = FALSE)
  npunct <- ntoks - sum(ntoken(toks))

  # symbols and emoji
  toks <- tokens(toks, remove_symbols = TRUE)
  nsym <- ntoks - npunct - sum(ntoken(toks))

  # numbers
  toks <- tokens(toks, remove_numbers = TRUE)
  nnumbers <- ntoks - npunct - nsym - sum(ntoken(toks))

  # words
  nwords <- ntoks - npunct - nsym - nnumbers

  # stopwords
  dfmat <- dfm(toks)
  nfeats <- nfeat(dfmat)
  dfmat <- dfm_remove(dfmat, stop_words)
  nstopwords <- nfeats - nfeat(dfmat)

  list(
    total_tokens = ntoks,
    total_punctuation = npunct,
    total_symbols = nsym,
    total_numbers = nnumbers,
    total_words = nwords,
    total_stopwords = nstopwords
  )
}

它returns，作为一个列表，你想要的数量：

summarize_texts_extended(data_corpus_inaugural)
## $total_tokens
## [1] 149138
## 
## $total_punctuation
## [1] 13852
## 
## $total_symbols
## [1] 4
## 
## $total_numbers
## [1] 85
## 
## $total_words
## [1] 135197
## 
## $total_stopwords
## [1] 136

如何使用 QUANTEDA，R 获取从数据集中删除的停用词类型列表

How to get a list of the types of stopwords removed from dataset using QUANTEDA, R

r

quanteda