如何使用 quanteda::tokens_select() 删除单字符和双字符标记
How to remove single and double char tokens using quanteda::tokens_select()
我正在尝试删除单字符和双字符标记。
这里有一个例子:
toks <- tokens(c("This is a sentence. This is a second sentence."), remove_punct = TRUE)
toks <- tokens_select(toks, min_nchar=1L, max_nchar=2L, selection = "remove")
toks
结果:
tokens from 1 document.
text1 :
[1] "is" "a" "is" "a"
我希望得到不符合条件的代币,而不是符合条件的代币。
您需要将给定的句子转换为标记。您可以执行以下操作:
library(quanteda)
# convert to tokens
tokens <- unlist(tokens(sent, remove_punct = T), use.names=F)
# to remove tokens with <= 2 characters
Filter(function(x) nchar(x) > 2, tokens)
[1] "This" "sentence" "This" "second" "sentence"
library(quanteda)
toks <- tokens(c("This is a sentence. This is a second sentence."), remove_punct = TRUE)
tokens_select(toks, min_nchar=3L)
似乎选择参数被忽略了。
这给出了我想要的结果。
toks <- tokens_select(toks, min_nchar=3L, max_nchar=79L)
我正在尝试删除单字符和双字符标记。
这里有一个例子:
toks <- tokens(c("This is a sentence. This is a second sentence."), remove_punct = TRUE)
toks <- tokens_select(toks, min_nchar=1L, max_nchar=2L, selection = "remove")
toks
结果:
tokens from 1 document. text1 :
[1] "is" "a" "is" "a"
我希望得到不符合条件的代币,而不是符合条件的代币。
您需要将给定的句子转换为标记。您可以执行以下操作:
library(quanteda)
# convert to tokens
tokens <- unlist(tokens(sent, remove_punct = T), use.names=F)
# to remove tokens with <= 2 characters
Filter(function(x) nchar(x) > 2, tokens)
[1] "This" "sentence" "This" "second" "sentence"
library(quanteda)
toks <- tokens(c("This is a sentence. This is a second sentence."), remove_punct = TRUE)
tokens_select(toks, min_nchar=3L)
似乎选择参数被忽略了。
这给出了我想要的结果。
toks <- tokens_select(toks, min_nchar=3L, max_nchar=79L)