在 R 中拆分包含多个定界符的字符串
Splitting strings containing multiple delimiters in R
我有一个向量 tissue
,其中包含由多个字符分隔的字符串。向量的组成字符串大致属于四个 类:
仅由 term(s) 组成的字符串(例如 Thymus
Thyroid
)由 ,
[ 分隔=24=]
包含标识符的字符串(例如ECO:0000313|RefSeq:XP_014046664.1
)以},
结尾,后跟项(s) 由 ,
分隔
包含 term 后跟 identifier
的字符串
包含 term 后跟 identifier 和 term(s) 的字符串 由 ,
分隔
tissue <- c("Head kidney,Thymus,Thyroid,",
"Red blood cell,",
"ECO:0000313|RefSeq:XP_014046664.1},Muscle,",
"ECO:0000313|RefSeq:XP_016683349.1},ECO:0000313|RefSeq:XP_016683354.1},Leaf,",
"ECO:0000313|RefSeq:XP_014023833.1},Head kidney,Muscle,White muscle,",
"Blood,ECO:0000313|RefSeq:XP_017326031.1},",
"Spleen,ECO:0000313|RefSeq:XP_010844217.1},ECO:0000313|RefSeq:XP_010844218.1},",
"Brain,ECO:0000313|RefSeq:XP_014030244.1},Head kidney,Muscle,Spleen,White muscle,")
对于属于类别 1 的字符串,我可以使用简单的 strsplit()
函数拆分术语
unlist(strsplit("Head kidney,Thymus,Thyroid,", ","))
[1] "Head kidney" "Thymus" "Thyroid"
unlist(strsplit("Red blood cell,", ","))
[1] "Red blood cell"
对于属于类别 2 的字符串,这就是我想出的并且工作正常
unlist(strsplit(sub('.*\},', "", "ECO:0000313|RefSeq:XP_014046664.1},Muscle,"), ","))
[1] "Muscle"
unlist(strsplit(sub('.*\},', "", "ECO:0000313|RefSeq:XP_016683349.1},ECO:0000313|RefSeq:XP_016683354.1},Leaf,"), ","))
[1] "Leaf"
unlist(strsplit(sub('.*\},', "", "ECO:0000313|RefSeq:XP_014023833.1},Head kidney,Muscle,White muscle,"), ","))
[1] "Head kidney" "Muscle" "White muscle"
对于属于类别 3 的字符串,这对我有用
sub(',ECO:.*', "", "Blood,ECO:0000313|RefSeq:XP_017326031.1},")
[1] "Blood"
sub(',ECO:.*', "", "Spleen,ECO:0000313|RefSeq:XP_010844217.1},ECO:0000313|RefSeq:XP_010844218.1},")
[1] "Spleen"
对于类别 4,这是我尝试过的并且效果很好
unlist(strsplit(sub(',ECO:.*},', ",", "Brain,ECO:0000313|RefSeq:XP_014030244.1},Head kidney,Muscle,Spleen,White muscle,"), ","))
[1] "Brain" "Head kidney" "Muscle" "Spleen" "White muscle"
我正在寻找一个解决方案,如果可能的话,一个正则表达式可以处理所有这些条件,并且可以直接在向量上使用。
我们可能会删除一些子串,然后使用strsplit
library(stringr)
lapply(strsplit(str_remove_all(tissue, "ECO:[^\}]+\}"), ","),
function(x) x[nzchar(x)])
-输出
[[1]]
[1] "Head kidney" "Thymus" "Thyroid"
[[2]]
[1] "Red blood cell"
[[3]]
[1] "Muscle"
[[4]]
[1] "Leaf"
[[5]]
[1] "Head kidney" "Muscle" "White muscle"
[[6]]
[1] "Blood"
[[7]]
[1] "Spleen"
[[8]]
[1] "Brain" "Head kidney" "Muscle" "Spleen" "White muscle"
或者使用 tidyverse 工作流程
library(dplyr)
library(tidyr)
str_remove_all(tissue, "ECO:[^\}]+\}") %>%
trimws(whitespace = ",+") %>%
str_replace_all(',{2,}', ",") %>%
tibble(col1 = .) %>%
tidyr::separate(col1, into = str_c('V',
seq(max(str_count(.$col1, ",")) + 1)), sep = ",", fill = "right")
-输出
# A tibble: 8 × 5
V1 V2 V3 V4 V5
<chr> <chr> <chr> <chr> <chr>
1 Head kidney Thymus Thyroid <NA> <NA>
2 Red blood cell <NA> <NA> <NA> <NA>
3 Muscle <NA> <NA> <NA> <NA>
4 Leaf <NA> <NA> <NA> <NA>
5 Head kidney Muscle White muscle <NA> <NA>
6 Blood <NA> <NA> <NA> <NA>
7 Spleen <NA> <NA> <NA> <NA>
8 Brain Head kidney Muscle Spleen White muscle
或仅使用 base R
read.csv(text = gsub(",{2,}", ",", trimws(gsub("ECO:[^\}]+\}",
"", tissue), whitespace = ",+")), header = FALSE, fill = TRUE, sep=",")
怎么样:
library(stringr)
x <- str_remove(unlist(str_match_all(tissue, '(.*?)(?=\,)')), '^ECO.*')
unique(x[x != ""])
[1] "Head kidney" "Thymus" "Thyroid" "Red blood cell"
[5] "Muscle" "Leaf" "White muscle" "Blood"
[9] "Spleen" "Brain"
我有一个向量 tissue
,其中包含由多个字符分隔的字符串。向量的组成字符串大致属于四个 类:
仅由 term(s) 组成的字符串(例如
Thymus
Thyroid
)由,
[ 分隔=24=]包含标识符的字符串(例如
分隔ECO:0000313|RefSeq:XP_014046664.1
)以},
结尾,后跟项(s) 由,
包含 term 后跟 identifier
的字符串包含 term 后跟 identifier 和 term(s) 的字符串 由
分隔,
tissue <- c("Head kidney,Thymus,Thyroid,", "Red blood cell,", "ECO:0000313|RefSeq:XP_014046664.1},Muscle,", "ECO:0000313|RefSeq:XP_016683349.1},ECO:0000313|RefSeq:XP_016683354.1},Leaf,", "ECO:0000313|RefSeq:XP_014023833.1},Head kidney,Muscle,White muscle,", "Blood,ECO:0000313|RefSeq:XP_017326031.1},", "Spleen,ECO:0000313|RefSeq:XP_010844217.1},ECO:0000313|RefSeq:XP_010844218.1},", "Brain,ECO:0000313|RefSeq:XP_014030244.1},Head kidney,Muscle,Spleen,White muscle,")
对于属于类别 1 的字符串,我可以使用简单的 strsplit()
函数拆分术语
unlist(strsplit("Head kidney,Thymus,Thyroid,", ","))
[1] "Head kidney" "Thymus" "Thyroid"
unlist(strsplit("Red blood cell,", ","))
[1] "Red blood cell"
对于属于类别 2 的字符串,这就是我想出的并且工作正常
unlist(strsplit(sub('.*\},', "", "ECO:0000313|RefSeq:XP_014046664.1},Muscle,"), ","))
[1] "Muscle"
unlist(strsplit(sub('.*\},', "", "ECO:0000313|RefSeq:XP_016683349.1},ECO:0000313|RefSeq:XP_016683354.1},Leaf,"), ","))
[1] "Leaf"
unlist(strsplit(sub('.*\},', "", "ECO:0000313|RefSeq:XP_014023833.1},Head kidney,Muscle,White muscle,"), ","))
[1] "Head kidney" "Muscle" "White muscle"
对于属于类别 3 的字符串,这对我有用
sub(',ECO:.*', "", "Blood,ECO:0000313|RefSeq:XP_017326031.1},")
[1] "Blood"
sub(',ECO:.*', "", "Spleen,ECO:0000313|RefSeq:XP_010844217.1},ECO:0000313|RefSeq:XP_010844218.1},")
[1] "Spleen"
对于类别 4,这是我尝试过的并且效果很好
unlist(strsplit(sub(',ECO:.*},', ",", "Brain,ECO:0000313|RefSeq:XP_014030244.1},Head kidney,Muscle,Spleen,White muscle,"), ","))
[1] "Brain" "Head kidney" "Muscle" "Spleen" "White muscle"
我正在寻找一个解决方案,如果可能的话,一个正则表达式可以处理所有这些条件,并且可以直接在向量上使用。
我们可能会删除一些子串,然后使用strsplit
library(stringr)
lapply(strsplit(str_remove_all(tissue, "ECO:[^\}]+\}"), ","),
function(x) x[nzchar(x)])
-输出
[[1]]
[1] "Head kidney" "Thymus" "Thyroid"
[[2]]
[1] "Red blood cell"
[[3]]
[1] "Muscle"
[[4]]
[1] "Leaf"
[[5]]
[1] "Head kidney" "Muscle" "White muscle"
[[6]]
[1] "Blood"
[[7]]
[1] "Spleen"
[[8]]
[1] "Brain" "Head kidney" "Muscle" "Spleen" "White muscle"
或者使用 tidyverse 工作流程
library(dplyr)
library(tidyr)
str_remove_all(tissue, "ECO:[^\}]+\}") %>%
trimws(whitespace = ",+") %>%
str_replace_all(',{2,}', ",") %>%
tibble(col1 = .) %>%
tidyr::separate(col1, into = str_c('V',
seq(max(str_count(.$col1, ",")) + 1)), sep = ",", fill = "right")
-输出
# A tibble: 8 × 5
V1 V2 V3 V4 V5
<chr> <chr> <chr> <chr> <chr>
1 Head kidney Thymus Thyroid <NA> <NA>
2 Red blood cell <NA> <NA> <NA> <NA>
3 Muscle <NA> <NA> <NA> <NA>
4 Leaf <NA> <NA> <NA> <NA>
5 Head kidney Muscle White muscle <NA> <NA>
6 Blood <NA> <NA> <NA> <NA>
7 Spleen <NA> <NA> <NA> <NA>
8 Brain Head kidney Muscle Spleen White muscle
或仅使用 base R
read.csv(text = gsub(",{2,}", ",", trimws(gsub("ECO:[^\}]+\}",
"", tissue), whitespace = ",+")), header = FALSE, fill = TRUE, sep=",")
怎么样:
library(stringr)
x <- str_remove(unlist(str_match_all(tissue, '(.*?)(?=\,)')), '^ECO.*')
unique(x[x != ""])
[1] "Head kidney" "Thymus" "Thyroid" "Red blood cell"
[5] "Muscle" "Leaf" "White muscle" "Blood"
[9] "Spleen" "Brain"