从字符向量中删除可能包含特殊字符的整个单词列表,而不匹配部分单词
Remove a list of whole words that may contain special chars from a character vector without matching parts of words
我在 R 中有一个单词列表,如下所示:
myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
我想从下面的文本中删除上面列表中的单词:
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
删除不需要的 myList 单词后,myText 应如下所示:
This is at Sample Text, which is better and cleaned, where is not equal to. This is messy text.
我正在使用:
stringr::str_replace_all(myText,"[^a-zA-Z\s]", " ")
但这对我没有帮助。我该怎么办??
gsub(paste0(myList, collapse = "|"), "", myText)
给出:
[1] "This is Sample Text, which is better and cleaned , where is not equal to . This is messy text ."
您可以使用具有 gsub
基础 R 函数的 PCRE 正则表达式(它也可以与 str_replace_all
中的 ICU 正则表达式一起使用):
\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)
参见regex demo。
详情
\s*
- 0 个或更多空格
(?<!\w)
- 负后视确保在当前位置之前没有单词 char
(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)
- 一个非捕获组,其中包含字符向量中的 转义 项以及您需要删除的词
(?!\w)
- 否定前瞻,确保在当前位置之后没有单词 char。
注意:我们不能在这里使用\b
词边界,因为myList
字符向量中的项目可能start/end与非词字符,而 \b
meaning 是上下文相关的。
看到一个 R demo online:
myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\])", "\\\1", s)) }
pat <- paste0("\s*(?<!\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."
详情
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\])", "\\\1", s)) }
- 转义 all special chars that need escaping in a PCRE pattern
paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|")
- 从搜索词向量创建一个 |
分隔的备选列表。
我在 R 中有一个单词列表,如下所示:
myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
我想从下面的文本中删除上面列表中的单词:
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
删除不需要的 myList 单词后,myText 应如下所示:
This is at Sample Text, which is better and cleaned, where is not equal to. This is messy text.
我正在使用:
stringr::str_replace_all(myText,"[^a-zA-Z\s]", " ")
但这对我没有帮助。我该怎么办??
gsub(paste0(myList, collapse = "|"), "", myText)
给出:
[1] "This is Sample Text, which is better and cleaned , where is not equal to . This is messy text ."
您可以使用具有 gsub
基础 R 函数的 PCRE 正则表达式(它也可以与 str_replace_all
中的 ICU 正则表达式一起使用):
\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)
参见regex demo。
详情
\s*
- 0 个或更多空格(?<!\w)
- 负后视确保在当前位置之前没有单词 char(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)
- 一个非捕获组,其中包含字符向量中的 转义 项以及您需要删除的词(?!\w)
- 否定前瞻,确保在当前位置之后没有单词 char。
注意:我们不能在这里使用\b
词边界,因为myList
字符向量中的项目可能start/end与非词字符,而 \b
meaning 是上下文相关的。
看到一个 R demo online:
myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\])", "\\\1", s)) }
pat <- paste0("\s*(?<!\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."
详情
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\])", "\\\1", s)) }
- 转义 all special chars that need escaping in a PCRE patternpaste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|")
- 从搜索词向量创建一个|
分隔的备选列表。