如何删除不同版本的停用词
How to remove different version of stopwords
我用这种方法从文本中删除停用词
dfm <-
tokens(df$text,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE) %>%
tokens_remove(pattern = stopwords(source = "smart")) %>%
tokens_wordstem()
然而在结果中我发现有这样的停用词:
dont
有什么方法可以在不使用自定义停用词列表的情况下删除它们吗?
stopwords
函数本身无法做到这一点。但是,您可以很容易地从 "smart" 词典创建自己的词典,然后删除不需要的词:
my_stopwords <- data.frame(word=stopwords(source="smart")) %>% filter(word != "dont")
您可以尝试使用几个包和函数来管理它。看来你对 tidyverse
很有信心,所以这里有一种解决方案。
请记住,这不是一个完美的方法,如果您的文本量非常少(短文本),我认为您可以手动管理它并删除错误:如果您不知道有多少和多少,我的解决方案可能会有所帮助有什么错误。
library(quanteda) # for your purposes
library(qdap) # to detect errors
library(tidytext) # lovely package about tidy texts
由于你没有分享你的数据,这里有一些假的:
df <- data.frame(id = c(1:2),text = c("dont panic", "don't panic"), stringsAsFactors = F)
df
id text
1 1 dont panic
2 2 don't panic
现在,首先我们必须删除错误:
unnested <- df %>% unnest_tokens(not.found,text) # one line per words
errors <- data.frame(check_spelling(unnested$not.found)) # check the errors, it could take time
full <- unnested %>% left_join(errors) # join them!
这里是结果:
full
id not.found row word.no suggestion more.suggestions
1 1 dont 1 1 don't donut, don, dot, docent, donate, donuts, dopant
2 1 panic NA <NA> <NA> NULL
3 2 don't NA <NA> <NA> NULL
4 2 panic NA <NA> <NA> NULL
现在可以很容易地整理它了:
full <- full %>%
# if there is a correction, replace the wrong word with it
mutate(word = ifelse(is.na(suggestion), not.found, suggestion)) %>%
# select useful columns
select(id,word) %>%
# group them and create the texts
group_by(id) %>%
summarise(text = paste(word, collapse = ' '))
full
# A tibble: 2 x 2
id text
<int> <chr>
1 1 don't panic
2 2 don't panic
现在您可以开始工作了:
tokens(as.character(full$text),
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE) %>%
tokens_remove(pattern = stopwords(source = "smart")) %>%
tokens_wordstem()
tokens from 2 documents.
text1 :
[1] "panic"
text2 :
[1] "panic"
当你说“删除它们”时,我假设你的意思是从你的标记中删除 dont
,而现有的停用词列表只删除 don’t
。 (虽然从你的问题或一些答案如何解释它并不完全清楚。)quanteda 框架中存在两个简单的解决方案。
首先,您可以将额外的删除模式附加到 tokens_remove()
调用中。
其次,您可以处理 stopwords()
返回的字符向量,以包含没有撇号的版本。
插图:
library("quanteda")
## Package version: 1.5.1
toks <- tokens("I don't know what I dont or cant know.")
# original
tokens_remove(toks, c(stopwords("en")))
## tokens from 1 document.
## text1 :
## [1] "know" "dont" "cant" "know" "."
# manual addition
tokens_remove(toks, c(stopwords("en"), "dont", "cant"))
## tokens from 1 document.
## text1 :
## [1] "know" "know" "."
# automatic addition to stopwords
tokens_remove(toks, c(
stopwords("en"),
stringi::stri_replace_all_fixed(stopwords("en"), "'", "")
))
## tokens from 1 document.
## text1 :
## [1] "know" "know" "."
我用这种方法从文本中删除停用词
dfm <-
tokens(df$text,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE) %>%
tokens_remove(pattern = stopwords(source = "smart")) %>%
tokens_wordstem()
然而在结果中我发现有这样的停用词:
dont
有什么方法可以在不使用自定义停用词列表的情况下删除它们吗?
stopwords
函数本身无法做到这一点。但是,您可以很容易地从 "smart" 词典创建自己的词典,然后删除不需要的词:
my_stopwords <- data.frame(word=stopwords(source="smart")) %>% filter(word != "dont")
您可以尝试使用几个包和函数来管理它。看来你对 tidyverse
很有信心,所以这里有一种解决方案。
请记住,这不是一个完美的方法,如果您的文本量非常少(短文本),我认为您可以手动管理它并删除错误:如果您不知道有多少和多少,我的解决方案可能会有所帮助有什么错误。
library(quanteda) # for your purposes
library(qdap) # to detect errors
library(tidytext) # lovely package about tidy texts
由于你没有分享你的数据,这里有一些假的:
df <- data.frame(id = c(1:2),text = c("dont panic", "don't panic"), stringsAsFactors = F)
df
id text
1 1 dont panic
2 2 don't panic
现在,首先我们必须删除错误:
unnested <- df %>% unnest_tokens(not.found,text) # one line per words
errors <- data.frame(check_spelling(unnested$not.found)) # check the errors, it could take time
full <- unnested %>% left_join(errors) # join them!
这里是结果:
full
id not.found row word.no suggestion more.suggestions
1 1 dont 1 1 don't donut, don, dot, docent, donate, donuts, dopant
2 1 panic NA <NA> <NA> NULL
3 2 don't NA <NA> <NA> NULL
4 2 panic NA <NA> <NA> NULL
现在可以很容易地整理它了:
full <- full %>%
# if there is a correction, replace the wrong word with it
mutate(word = ifelse(is.na(suggestion), not.found, suggestion)) %>%
# select useful columns
select(id,word) %>%
# group them and create the texts
group_by(id) %>%
summarise(text = paste(word, collapse = ' '))
full
# A tibble: 2 x 2
id text
<int> <chr>
1 1 don't panic
2 2 don't panic
现在您可以开始工作了:
tokens(as.character(full$text),
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE) %>%
tokens_remove(pattern = stopwords(source = "smart")) %>%
tokens_wordstem()
tokens from 2 documents.
text1 :
[1] "panic"
text2 :
[1] "panic"
当你说“删除它们”时,我假设你的意思是从你的标记中删除 dont
,而现有的停用词列表只删除 don’t
。 (虽然从你的问题或一些答案如何解释它并不完全清楚。)quanteda 框架中存在两个简单的解决方案。
首先,您可以将额外的删除模式附加到 tokens_remove()
调用中。
其次,您可以处理 stopwords()
返回的字符向量,以包含没有撇号的版本。
插图:
library("quanteda")
## Package version: 1.5.1
toks <- tokens("I don't know what I dont or cant know.")
# original
tokens_remove(toks, c(stopwords("en")))
## tokens from 1 document.
## text1 :
## [1] "know" "dont" "cant" "know" "."
# manual addition
tokens_remove(toks, c(stopwords("en"), "dont", "cant"))
## tokens from 1 document.
## text1 :
## [1] "know" "know" "."
# automatic addition to stopwords
tokens_remove(toks, c(
stopwords("en"),
stringi::stri_replace_all_fixed(stopwords("en"), "'", "")
))
## tokens from 1 document.
## text1 :
## [1] "know" "know" "."