如何只保留 R 中复杂字符串中的信息？

Question

我想在复杂字符串中保留一串字符。我认为我可以使用正则表达式来保留我需要的东西。基本上，我只想在 Function=\"SMAD5\" 中保留 \" 和 \" 之间的信息。我还想保留空字符串：Function=\"\"

df=structure(1:6, .Label = c("ID=Gfo_R000001;Source=ENST00000513418;Function=\"SMAD5\";", 
"ID=Gfo_R000002;Source=ENSTGUT00000017468;Function=\"CENPA\";", 
"ID=Gfo_R000003;Source=ENSGALT00000028134;Function=\"C1QL4\";", 
"ID=Gfo_R000004;Source=ENSTGUT00000015300;Function=\"\";", "ID=Gfo_R000005;Source=ENSTGUT00000019268;Function=\"\";", 
"ID=Gfo_R000006;Source=ENSTGUT00000019035;Function=\"\";"), class = "factor")

这应该是这样的：

"SMAD5"
"CENPA"
"C1QL4"
NA
NA
NA

到目前为止我能做的：

gsub('.*Function=\"',"",df)

[1] "SMAD5\";" "CENPA\";" "C1QL4\";" "\";"      "\";"      "\";"

但我受困于一堆 \";"。我怎样才能用一行删除它们？

我试过这个：

gsub('.*Function=\"' & '.\"*',"",test)

但它给我这个错误：

Error in ".*Function=\"" & ".\"*" : 
  operations are possible only for numeric, logical or complex types

Answer 1

您可以使用

gsub(".*Function=\"([^\"]*).*","\1",df)

见regex demo

详情:

.* - 任何 0+ 个字符，直到最后一个字符为止...
Function=\" - Function=" 子串
([^\"]*) - 捕获第 1 组匹配 "
.* - 以及字符串的其余部分。

是恢复结果中第 1 组内容的反向引用。

Answer 2

使用 stringr 我们也可以捕获组：

library(stringr)
matches <- str_match(df, ".*\"(.*)\".*")[,2]
ifelse(matches=='', NA, matches)
# [1] "SMAD5" "CENPA" "C1QL4" NA      NA      NA

Answer 3

可以使用 rebus.

构造更易读的正则表达式

rx <- 'Function="' %R% 
  capture(zero_or_more(negated_char_class('"')))

然后匹配就是Wiktor和sandipan说的

rx <- 'Function="' %R% capture(zero_or_more(negated_char_class('"')))
str_match(df, rx)
stri_match_first_regex(df, rx)

gsub(any_char(0, Inf) %R% rx %R% any_char(0, Inf), REF1, df)

如何只保留 R 中复杂字符串中的信息？

How to keep only information inside a complex string in R?

regex

split

r

gsub