使用 quanteda 删除自定义停用词和短语
Remove custom stopwords and phrases using quanteda
我有我的停用词列表,我想用它从文本中删除特定的短语:
#dummy text
df2 <- c("hi my name is Ann and code code all the time! However not after that I would like")
mystopwords <- c("hi", "code code", "not after that")
我使用这个选项:
myDfm <- df2 %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_remove(pattern = c(stopwords(source = "smart"), mystopwords)) %>%
tokens_wordstem() %>%
tokens_ngrams(n = c(1, 3)) %>%
dfm()
但是当我检查二元组或三元组的频率时,它们并没有被删除,只是被词干化了。
语法有问题吗?
您可以在使用停止短语列表时使用 phrase()
函数来实现。
它是这样工作的:
library(quanteda)
df2 <- c("hi my name is Ann and code code all the time! However not after that I would like")
mystopwords <- c("hi", "code code", "not after that")
df2 %>% tokens %>%
tokens_remove(pattern = phrase(mystopwords), valuetype = 'fixed')
## tokens from 1 document.
## text1 :
## [1] "my" "name" "is" "Ann" "and" "all" "the" "time" "!" "However" "I" "would"
## [13] "like"
您可以在此处获取有关如何在 quanteda 中使用多词表达式的详细信息:
https://quanteda.io/articles/pkgdown/examples/phrase.html
我有我的停用词列表,我想用它从文本中删除特定的短语:
#dummy text
df2 <- c("hi my name is Ann and code code all the time! However not after that I would like")
mystopwords <- c("hi", "code code", "not after that")
我使用这个选项:
myDfm <- df2 %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_remove(pattern = c(stopwords(source = "smart"), mystopwords)) %>%
tokens_wordstem() %>%
tokens_ngrams(n = c(1, 3)) %>%
dfm()
但是当我检查二元组或三元组的频率时,它们并没有被删除,只是被词干化了。
语法有问题吗?
您可以在使用停止短语列表时使用 phrase()
函数来实现。
它是这样工作的:
library(quanteda)
df2 <- c("hi my name is Ann and code code all the time! However not after that I would like")
mystopwords <- c("hi", "code code", "not after that")
df2 %>% tokens %>%
tokens_remove(pattern = phrase(mystopwords), valuetype = 'fixed')
## tokens from 1 document.
## text1 :
## [1] "my" "name" "is" "Ann" "and" "all" "the" "time" "!" "However" "I" "would"
## [13] "like"
您可以在此处获取有关如何在 quanteda 中使用多词表达式的详细信息: https://quanteda.io/articles/pkgdown/examples/phrase.html