从字符向量中删除可能包含特殊字符的整个单词列表，而不匹配部分单词

Question

我在 R 中有一个单词列表，如下所示：

 myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")

我想从下面的文本中删除上面列表中的单词：

 myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."

删除不需要的 myList 单词后，myText 应如下所示：

  This is at Sample Text, which is better and cleaned, where is not equal to. This is messy text.

我正在使用：

  stringr::str_replace_all(myText,"[^a-zA-Z\s]", " ")

但这对我没有帮助。我该怎么办？？

Answer 1

gsub(paste0(myList, collapse = "|"), "", myText)

给出：

[1] "This is  Sample  Text, which  is  better and cleaned , where  is not equal to . This is messy text ."

Answer 2

您可以使用具有 gsub 基础 R 函数的 PCRE 正则表达式（它也可以与 str_replace_all 中的 ICU 正则表达式一起使用）：

\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)

参见regex demo。

详情

\s* - 0 个或更多空格
(?<!\w) - 负后视确保在当前位置之前没有单词 char
(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00) - 一个非捕获组，其中包含字符向量中的转义项以及您需要删除的词
(?!\w) - 否定前瞻，确保在当前位置之后没有单词 char。

注意：我们不能在这里使用\b词边界，因为myList字符向量中的项目可能start/end与非词字符，而 \b meaning 是上下文相关的。

看到一个 R demo online:

myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\])", "\\\1", s)) }
pat <- paste0("\s*(?<!\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."

详情

escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\])", "\\\1", s)) } - 转义 all special chars that need escaping in a PCRE pattern
paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|") - 从搜索词向量创建一个 | 分隔的备选列表。

从字符向量中删除可能包含特殊字符的整个单词列表，而不匹配部分单词

Remove a list of whole words that may contain special chars from a character vector without matching parts of words

regex

r

gsub

stringr