如何提取在 R 中一个字符之后和最后一次出现另一个字符之前发生的所有内容?
How to extract everything occurring after a character and before the last occurrence of another character in R?
我有如下所示的三个字符串:
"GO:0016559~peroxisome fission,"
"GO:0006122~mitochondrial electron transport, ubiquinol to cytochrome c,"
"GO:0006122~mitochondrial electron transport, ubiquinol to cytochrome c,GO:0006334~nucleosome assembly,"
如何提取在“~”之后和“,”之前出现的所有子字符串(字符串的末尾或后跟 GO:.........,)?
期望的输出:
"peroxisome fission"
"mitochondrial electron transport, ubiquinol to cytochrome c"
"mitochondrial electron transport, ubiquinol to cytochrome c" "nucleosome assembly"
这将在 R 中的一个通用语句中实现。
我试过用这个:
strapplyc(str, "[~](.*?)[,]", simplify = c)
(其中 str 是一个变量,它循环存储三个字符串中的每一个,一次一个)
但是我得到的输出是:
"peroxisome fission"
"mitochondrial electron transport"
"mitochondrial electron transport" "nucleosome assembly"
您可以使用
(?<=~).*?(?=,(?:GO:\d+~|$))
见regex demo。 详情:
(?<=~)
- ~
char 之后的位置
.*?
- 除换行字符外的任何零个或多个字符,尽可能少
(?=,(?:GO:\d+~|$))
- 正向前瞻,需要一个逗号,然后是 GO:
、一个或多个数字和 ~
或紧靠当前位置右侧的字符串结尾。
看到 R demo:
> library(stringr)
> x <- c("GO:0016559~peroxisome fission,","GO:0006122~mitochondrial electron transport, ubiquinol to cytochrome c,","GO:0006122~mitochondrial electron transport, ubiquinol to cytochrome c,GO:0006334~nucleosome assembly,")
> unlist(str_extract_all(x, "(?<=~).*?(?=,(?:GO:\d+~|$))"))
[1] "peroxisome fission"
[2] "mitochondrial electron transport, ubiquinol to cytochrome c"
[3] "mitochondrial electron transport, ubiquinol to cytochrome c"
[4] "nucleosome assembly"
在基础 R 中,你可以这样做:
sub(".*~",'', grep("~",t(read.csv(text = s, header = FALSE)), value = TRUE))
[1] "peroxisome fission" "mitochondrial electron transport"
[3] "mitochondrial electron transport" "nucleosome assembly"
我有如下所示的三个字符串:
"GO:0016559~peroxisome fission,"
"GO:0006122~mitochondrial electron transport, ubiquinol to cytochrome c,"
"GO:0006122~mitochondrial electron transport, ubiquinol to cytochrome c,GO:0006334~nucleosome assembly,"
如何提取在“~”之后和“,”之前出现的所有子字符串(字符串的末尾或后跟 GO:.........,)?
期望的输出:
"peroxisome fission"
"mitochondrial electron transport, ubiquinol to cytochrome c"
"mitochondrial electron transport, ubiquinol to cytochrome c" "nucleosome assembly"
这将在 R 中的一个通用语句中实现。
我试过用这个:
strapplyc(str, "[~](.*?)[,]", simplify = c)
(其中 str 是一个变量,它循环存储三个字符串中的每一个,一次一个)
但是我得到的输出是:
"peroxisome fission"
"mitochondrial electron transport"
"mitochondrial electron transport" "nucleosome assembly"
您可以使用
(?<=~).*?(?=,(?:GO:\d+~|$))
见regex demo。 详情:
(?<=~)
-~
char 之后的位置
.*?
- 除换行字符外的任何零个或多个字符,尽可能少(?=,(?:GO:\d+~|$))
- 正向前瞻,需要一个逗号,然后是GO:
、一个或多个数字和~
或紧靠当前位置右侧的字符串结尾。
看到 R demo:
> library(stringr)
> x <- c("GO:0016559~peroxisome fission,","GO:0006122~mitochondrial electron transport, ubiquinol to cytochrome c,","GO:0006122~mitochondrial electron transport, ubiquinol to cytochrome c,GO:0006334~nucleosome assembly,")
> unlist(str_extract_all(x, "(?<=~).*?(?=,(?:GO:\d+~|$))"))
[1] "peroxisome fission"
[2] "mitochondrial electron transport, ubiquinol to cytochrome c"
[3] "mitochondrial electron transport, ubiquinol to cytochrome c"
[4] "nucleosome assembly"
在基础 R 中,你可以这样做:
sub(".*~",'', grep("~",t(read.csv(text = s, header = FALSE)), value = TRUE))
[1] "peroxisome fission" "mitochondrial electron transport"
[3] "mitochondrial electron transport" "nucleosome assembly"