在 R 中拆分包含多个定界符的字符串

Splitting strings containing multiple delimiters in R

我有一个向量 tissue,其中包含由多个字符分隔的字符串。向量的组成字符串大致属于四个 类:

  1. 仅由 term(s) 组成的字符串(例如 Thymus Thyroid)由 ,[ 分隔=24=]

  2. 包含标识符的字符串(例如ECO:0000313|RefSeq:XP_014046664.1)以},结尾,后跟项(s),

    分隔
  3. 包含 term 后跟 identifier

    的字符串
  4. 包含 term 后跟 identifierterm(s) 的字符串,

    分隔
    tissue <- c("Head kidney,Thymus,Thyroid,", 
                "Red blood cell,", 
                "ECO:0000313|RefSeq:XP_014046664.1},Muscle,",
                "ECO:0000313|RefSeq:XP_016683349.1},ECO:0000313|RefSeq:XP_016683354.1},Leaf,", 
                "ECO:0000313|RefSeq:XP_014023833.1},Head kidney,Muscle,White muscle,",
                "Blood,ECO:0000313|RefSeq:XP_017326031.1},",
                "Spleen,ECO:0000313|RefSeq:XP_010844217.1},ECO:0000313|RefSeq:XP_010844218.1},",
                "Brain,ECO:0000313|RefSeq:XP_014030244.1},Head kidney,Muscle,Spleen,White muscle,")
    

对于属于类别 1 的字符串,我可以使用简单的 strsplit() 函数拆分术语

unlist(strsplit("Head kidney,Thymus,Thyroid,", ","))
[1] "Head kidney" "Thymus"      "Thyroid" 

unlist(strsplit("Red blood cell,", ","))
[1] "Red blood cell"

对于属于类别 2 的字符串,这就是我想出的并且工作正常

unlist(strsplit(sub('.*\},', "", "ECO:0000313|RefSeq:XP_014046664.1},Muscle,"), ","))
[1] "Muscle"

unlist(strsplit(sub('.*\},', "", "ECO:0000313|RefSeq:XP_016683349.1},ECO:0000313|RefSeq:XP_016683354.1},Leaf,"), ","))
[1] "Leaf"

unlist(strsplit(sub('.*\},', "", "ECO:0000313|RefSeq:XP_014023833.1},Head kidney,Muscle,White muscle,"), ","))
[1] "Head kidney"  "Muscle"       "White muscle"

对于属于类别 3 的字符串,这对我有用

sub(',ECO:.*', "", "Blood,ECO:0000313|RefSeq:XP_017326031.1},")
[1] "Blood"

sub(',ECO:.*', "", "Spleen,ECO:0000313|RefSeq:XP_010844217.1},ECO:0000313|RefSeq:XP_010844218.1},")
[1] "Spleen"

对于类别 4,这是我尝试过的并且效果很好

unlist(strsplit(sub(',ECO:.*},', ",", "Brain,ECO:0000313|RefSeq:XP_014030244.1},Head kidney,Muscle,Spleen,White muscle,"), ","))
[1] "Brain"        "Head kidney"  "Muscle"       "Spleen"       "White muscle"

我正在寻找一个解决方案,如果可能的话,一个正则表达式可以处理所有这些条件,并且可以直接在向量上使用。

我们可能会删除一些子串,然后使用strsplit

library(stringr)
lapply(strsplit(str_remove_all(tissue, "ECO:[^\}]+\}"), ","), 
     function(x) x[nzchar(x)])

-输出

[[1]]
[1] "Head kidney" "Thymus"      "Thyroid"    

[[2]]
[1] "Red blood cell"

[[3]]
[1] "Muscle"

[[4]]
[1] "Leaf"

[[5]]
[1] "Head kidney"  "Muscle"       "White muscle"

[[6]]
[1] "Blood"

[[7]]
[1] "Spleen"

[[8]]
[1] "Brain"        "Head kidney"  "Muscle"       "Spleen"       "White muscle"

或者使用 tidyverse 工作流程

library(dplyr)
library(tidyr)
str_remove_all(tissue, "ECO:[^\}]+\}") %>% 
  trimws(whitespace = ",+") %>%
  str_replace_all(',{2,}', ",") %>% 
  tibble(col1 = .) %>% 
  tidyr::separate(col1, into = str_c('V', 
    seq(max(str_count(.$col1, ",")) + 1)), sep = ",", fill = "right")

-输出

# A tibble: 8 × 5
  V1             V2          V3           V4     V5          
  <chr>          <chr>       <chr>        <chr>  <chr>       
1 Head kidney    Thymus      Thyroid      <NA>   <NA>        
2 Red blood cell <NA>        <NA>         <NA>   <NA>        
3 Muscle         <NA>        <NA>         <NA>   <NA>        
4 Leaf           <NA>        <NA>         <NA>   <NA>        
5 Head kidney    Muscle      White muscle <NA>   <NA>        
6 Blood          <NA>        <NA>         <NA>   <NA>        
7 Spleen         <NA>        <NA>         <NA>   <NA>        
8 Brain          Head kidney Muscle       Spleen White muscle

或仅使用 base R

read.csv(text = gsub(",{2,}", ",", trimws(gsub("ECO:[^\}]+\}", 
    "", tissue), whitespace = ",+")), header = FALSE, fill = TRUE, sep=",")

怎么样:

library(stringr)

x <- str_remove(unlist(str_match_all(tissue, '(.*?)(?=\,)')), '^ECO.*')
unique(x[x != ""])
[1] "Head kidney"    "Thymus"         "Thyroid"        "Red blood cell"
 [5] "Muscle"         "Leaf"           "White muscle"   "Blood"         
 [9] "Spleen"         "Brain"