在 R 中使用 gsub 替换以模式开头的整个单词
Replace the whole word that starts with a pattern using gsub in R
我遇到了一个应该很容易解决的问题。
我想替换以模式开头的字符串中的整个单词。
> test <- "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't."
## this is what i want
> output
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."
到目前为止我得到的最好的是这个
# this is what get, but it's not correct
> gsub("\<wasn*.\>", "wasn't", test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't't aware. Just wasn't't."
我真的运行没主意了。我也会很高兴
# second desired output without the . at the end
> output
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't"
编辑:看来我的问题有点太具体了。所以,我正在添加其他测试用例。基本上,我不知道 "wasn" 后面会有什么字符,我想将所有字符都转换为 wasn't
> test <- "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple"
> test
[1] "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple"
#desired output
> output
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"
您可以使用 perl 提供的负面展望.. pattern=wasn(?!')t*
gsub("wasn(?!')t*","wasn't",test,perl=T)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."
或者你可以这样做:
gsub("wasn'*t*","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."
对于第二个期望的输出:
gsub("wasn'*t*[.]?","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't"
编辑后:
gsub("wasn[^. ]*","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"
我建议这样的解决方案:
test <- c("i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple", "Wasn&^$tt that nice?", "You say wasnmmmt?", "No, he wasn&#t#@$.", "She wasn%#@t##, I know.")
gsub("\b(wasn)\S*\b(?:\S*(\p{P})\B)?", "\1't\2", test, ignore.case=TRUE, perl=TRUE)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"
[2] "Wasn't that nice?"
[3] "You say wasn't?"
[4] "No, he wasn't."
[5] "She wasn't, I know."
看到 online R demo。
此解决方案解决了 wasn*
出现在字符串开头或被大写但不替换尾随标点符号的情况。
图案详情
\b
- 单词边界
(wasn)
- 捕获第 1 组(稍后在替换模式中用 \1
引用):一个 wasn
子字符串(由于 ignore.case=TRUE
不区分大小写)
\S*\b
- 除空格外的任何 0+ 个字符后跟单词边界
(?:\S*(\p{P})\B)?
- 可选的非捕获组,匹配 1 次或 0 次出现
\S*
- 0+ 个非空白字符
(\p{P})
- 捕获第 2 组(稍后在替换模式中用 \2
引用):任意 1 个标点符号(不是符号!\p{P}
不等于 [:punct:]
!) 符号后面没有...
\B
- 字母、数字或 _
(它是非单词边界模式)。
对于更乱的字符串(如 She wasn%#@t##,$#^ I know.
),当标点符号可以在其他标点符号内时,您可以使用自定义括号表达式限制要停止的标点符号并添加 \S*
最后:
gsub("\b(wasn)\S*\b(?:\S*([?!.,:;])\S*)?", "\1't\2", test, ignore.case=TRUE, perl=TRUE)
参见regex demo。
为什么不保持简单,将任何以 wasn
开头的单词替换为 wasn't
?
test2 <- paste0(
"i really wasn aware and i wasnt aware at all. but i wasn't aware. just",
"wasn't. this wasn45'e meant to be. it wasn@'re simple"
)
gsub("wasn[^ ]*", "wasn't", test2)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't this wasn't meant to be. it wasn't simple"
如果还处理大写字母,那么您可以将 ignore.case = TRUE
添加到 gsub()。
我遇到了一个应该很容易解决的问题。 我想替换以模式开头的字符串中的整个单词。
> test <- "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't."
## this is what i want
> output
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."
到目前为止我得到的最好的是这个
# this is what get, but it's not correct
> gsub("\<wasn*.\>", "wasn't", test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't't aware. Just wasn't't."
我真的运行没主意了。我也会很高兴
# second desired output without the . at the end
> output
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't"
编辑:看来我的问题有点太具体了。所以,我正在添加其他测试用例。基本上,我不知道 "wasn" 后面会有什么字符,我想将所有字符都转换为 wasn't
> test <- "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple"
> test
[1] "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple"
#desired output
> output
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"
您可以使用 perl 提供的负面展望.. pattern=wasn(?!')t*
gsub("wasn(?!')t*","wasn't",test,perl=T)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."
或者你可以这样做:
gsub("wasn'*t*","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."
对于第二个期望的输出:
gsub("wasn'*t*[.]?","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't"
编辑后:
gsub("wasn[^. ]*","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"
我建议这样的解决方案:
test <- c("i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple", "Wasn&^$tt that nice?", "You say wasnmmmt?", "No, he wasn&#t#@$.", "She wasn%#@t##, I know.")
gsub("\b(wasn)\S*\b(?:\S*(\p{P})\B)?", "\1't\2", test, ignore.case=TRUE, perl=TRUE)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"
[2] "Wasn't that nice?"
[3] "You say wasn't?"
[4] "No, he wasn't."
[5] "She wasn't, I know."
看到 online R demo。
此解决方案解决了 wasn*
出现在字符串开头或被大写但不替换尾随标点符号的情况。
图案详情
\b
- 单词边界(wasn)
- 捕获第 1 组(稍后在替换模式中用\1
引用):一个wasn
子字符串(由于ignore.case=TRUE
不区分大小写)\S*\b
- 除空格外的任何 0+ 个字符后跟单词边界(?:\S*(\p{P})\B)?
- 可选的非捕获组,匹配 1 次或 0 次出现\S*
- 0+ 个非空白字符(\p{P})
- 捕获第 2 组(稍后在替换模式中用\2
引用):任意 1 个标点符号(不是符号!\p{P}
不等于[:punct:]
!) 符号后面没有...\B
- 字母、数字或_
(它是非单词边界模式)。
对于更乱的字符串(如 She wasn%#@t##,$#^ I know.
),当标点符号可以在其他标点符号内时,您可以使用自定义括号表达式限制要停止的标点符号并添加 \S*
最后:
gsub("\b(wasn)\S*\b(?:\S*([?!.,:;])\S*)?", "\1't\2", test, ignore.case=TRUE, perl=TRUE)
参见regex demo。
为什么不保持简单,将任何以 wasn
开头的单词替换为 wasn't
?
test2 <- paste0(
"i really wasn aware and i wasnt aware at all. but i wasn't aware. just",
"wasn't. this wasn45'e meant to be. it wasn@'re simple"
)
gsub("wasn[^ ]*", "wasn't", test2)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't this wasn't meant to be. it wasn't simple"
如果还处理大写字母,那么您可以将 ignore.case = TRUE
添加到 gsub()。