如何删除不同版本的停用词

How to remove different version of stopwords

我用这种方法从文本中删除停用词

dfm <- 
    tokens(df$text,
           remove_punct = TRUE, 
           remove_numbers = TRUE, 
           remove_symbols = TRUE) %>%
    tokens_remove(pattern = stopwords(source = "smart")) %>%
      tokens_wordstem()

然而在结果中我发现有这样的停用词:

dont

有什么方法可以在不使用自定义停用词列表的情况下删除它们吗?

stopwords 函数本身无法做到这一点。但是,您可以很容易地从 "smart" 词典创建自己的词典,然后删除不需要的词:

my_stopwords <- data.frame(word=stopwords(source="smart")) %>% filter(word != "dont")

您可以尝试使用几个包和函数来管理它。看来你对 tidyverse 很有信心,所以这里有一种解决方案。

请记住,这不是一个完美的方法,如果您的文本量非常少(短文本),我认为您可以手动管理它并删除错误:如果您不知道有多少和多少,我的解决方案可能会有所帮助有什么错误。

library(quanteda) # for your purposes
library(qdap)     # to detect errors
library(tidytext) # lovely package about tidy texts

由于你没有分享你的数据,这里有一些假的:

df <- data.frame(id = c(1:2),text = c("dont panic", "don't panic"), stringsAsFactors = F)
 df
  id        text
1  1  dont panic
2  2 don't panic

现在,首先我们必须删除错误:

unnested <- df %>% unnest_tokens(not.found,text)         # one line per words
errors <- data.frame(check_spelling(unnested$not.found)) # check the errors, it could take time
full <- unnested %>% left_join(errors)                   # join them!

这里是结果:

full 
  id not.found row word.no suggestion                                more.suggestions
1  1      dont   1       1      don't donut, don, dot, docent, donate, donuts, dopant
2  1     panic  NA    <NA>       <NA>                                            NULL
3  2     don't  NA    <NA>       <NA>                                            NULL
4  2     panic  NA    <NA>       <NA>                                            NULL

现在可以很容易地整理它了:

full <- full %>% 
       # if there is a correction, replace the wrong word with it                                                           
       mutate(word = ifelse(is.na(suggestion), not.found, suggestion)) %>%
       # select useful columns
       select(id,word) %>%
       # group them and create the texts
       group_by(id) %>%
       summarise(text = paste(word, collapse = ' '))

full 
# A tibble: 2 x 2
     id text       
  <int> <chr>      
1     1 don't panic
2     2 don't panic

现在您可以开始工作了:

tokens(as.character(full$text),
       remove_punct = TRUE, 
       remove_numbers = TRUE, 
       remove_symbols = TRUE) %>%
  tokens_remove(pattern = stopwords(source = "smart")) %>%
  tokens_wordstem()

tokens from 2 documents.
text1 :
[1] "panic"

text2 :
[1] "panic"

当你说“删除它们”时,我假设你的意思是从你的标记中删除 dont,而现有的停用词列表只删除 don’t。 (虽然从你的问题或一些答案如何解释它并不完全清楚。)quanteda 框架中存在两个简单的解决方案。

首先,您可以将额外的删除模式附加到 tokens_remove() 调用中。

其次,您可以处理 stopwords() 返回的字符向量,以包含没有撇号的版本。

插图:

library("quanteda")
## Package version: 1.5.1

toks <- tokens("I don't know what I dont or cant know.")

# original
tokens_remove(toks, c(stopwords("en")))
## tokens from 1 document.
## text1 :
## [1] "know" "dont" "cant" "know" "."

# manual addition
tokens_remove(toks, c(stopwords("en"), "dont", "cant"))
## tokens from 1 document.
## text1 :
## [1] "know" "know" "."

# automatic addition to stopwords
tokens_remove(toks, c(
  stopwords("en"),
  stringi::stri_replace_all_fixed(stopwords("en"), "'", "")
))
## tokens from 1 document.
## text1 :
## [1] "know" "know" "."