从字符向量中删除可能包含特殊字符的整个单词列表,而不匹配部分单词

Remove a list of whole words that may contain special chars from a character vector without matching parts of words

我在 R 中有一个单词列表,如下所示:

 myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")

我想从下面的文本中删除上面列表中的单词:

 myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."

删除不需要的 myList 单词后,myText 应如下所示:

  This is at Sample Text, which is better and cleaned, where is not equal to. This is messy text.

我正在使用:

  stringr::str_replace_all(myText,"[^a-zA-Z\s]", " ")

但这对我没有帮助。我该怎么办??

gsub(paste0(myList, collapse = "|"), "", myText)

给出:

[1] "This is  Sample  Text, which  is  better and cleaned , where  is not equal to . This is messy text ."

您可以使用具有 gsub 基础 R 函数的 PCRE 正则表达式(它也可以与 str_replace_all 中的 ICU 正则表达式一起使用):

\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)

参见regex demo

详情

  • \s* - 0 个或更多空格
  • (?<!\w) - 负后视确保在当前位置之前没有单词 char
  • (?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00) - 一个非捕获组,其中包含字符向量中的 转义 项以及您需要删除的词
  • (?!\w) - 否定前瞻,确保在当前位置之后没有单词 char。

注意:我们不能在这里使用\b词边界,因为myList字符向量中的项目可能start/end与非词字符,而 \b meaning 是上下文相关的。

看到一个 R demo online:

myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\])", "\\\1", s)) }
pat <- paste0("\s*(?<!\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."

详情

  • escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\])", "\\\1", s)) } - 转义 all special chars that need escaping in a PCRE pattern
  • paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|") - 从搜索词向量创建一个 | 分隔的备选列表。