如何使用 quanteda::tokens_select() 删除单字符和双字符标记

How to remove single and double char tokens using quanteda::tokens_select()

我正在尝试删除单字符和双字符标记。

这里有一个例子:

toks <- tokens(c("This is a sentence. This is a second sentence."), remove_punct = TRUE)

toks <- tokens_select(toks, min_nchar=1L, max_nchar=2L, selection = "remove")

toks

结果:

tokens from 1 document. text1 :

[1] "is" "a" "is" "a"

我希望得到不符合条件的代币,而不是符合条件的代币。

您需要将给定的句子转换为标记。您可以执行以下操作:

library(quanteda)

# convert to tokens
tokens <- unlist(tokens(sent, remove_punct = T), use.names=F)

# to remove tokens with <= 2 characters
Filter(function(x) nchar(x) > 2, tokens)

[1] "This"     "sentence" "This"     "second"   "sentence"
library(quanteda)

toks <- tokens(c("This is a sentence. This is a second sentence."), remove_punct = TRUE)
tokens_select(toks, min_nchar=3L)

似乎选择参数被忽略了。

这给出了我想要的结果。

toks <- tokens_select(toks, min_nchar=3L, max_nchar=79L)