tidyr：分隔列，同时在第一列中保留定界符

Question

我有一个列，我想在保留分隔符的同时将其分成两部分。我已经走到这一步了，但是部分定界符被删除了。我还需要再次拆分，将分隔符添加到第一列，但我不知道该怎么做。

duplicates <- data.frame(sample = c("a_1_b1", "a1_2_b1", "a1_c_1_b2"))

duplicates <- separate(duplicates, 
                       sample, 
                       into = c("strain", "sample"),
                       sep = "_(?=[:digit:])")

仅以名字为例，我的输出是 a_1 和 b1，而我想要的输出是 a_1 和 _b1。

我还想执行此拆分，并将分隔符添加到第一列，如下所示。

样本	批量
a_1_	b1
a1_2_	b1
a1_c_1_	b2

编辑：This post 没有回答我的问题，即如何保留定界符，或控制它最终位于拆分的哪一侧。

Answer 1

您可以将 tidyr::extract 与捕获组一起使用。

tidyr::extract(duplicates, sample, c("strain", "sample"), '(.*_)(\w+)')

#   strain sample
#1    a_1_     b1
#2   a1_2_     b1
#3 a1_c_1_     b2

相同的正则表达式也可以与基础 R 中的 strcapture -

一起使用

strcapture('(.*_)(\w+)', duplicates$sample, 
           proto = list(strain = character(), sample = character()))

Answer 2

更新：见评论中OP的要求：

duplicates %>% 
    mutate(batch = sub(".*_", "_", sample)) %>%  
    mutate(sample = sub("_[^_]+$", "", sample))

输出：

  sample batch
1    a_1   _b1
2   a1_2   _b1
3 a1_c_1   _b2

澄清后更新：见评论：

duplicates %>% 
    mutate(batch = sub(".*_", "", sample)) %>%  
    mutate(sample = sub("_[^_]+$", "_", sample))

输出：

   sample batch
1    a_1_    b1
2   a1_2_    b1
3 a1_c_1_    b2

第一个回答： 我们可以使用 stringr 包中的 str_sub：

library(stringr)
library(dplyr)

duplicates %>% 
    mutate(batch = str_sub(sample, -2,-1)) %>% 
    mutate(sample = str_sub(sample, end=-3))

输出：

   sample batch
1    a_1_    b1
2   a1_2_    b1
3 a1_c_1_    b2

Answer 3

使用separate

library(tidyr)
separate(duplicates, sample, into = c("strain", "sample"), 
        sep = "(?<=_)(?=[^_]+$)")

-输出

    strain sample
1    a_1_     b1
2   a1_2_     b1
3 a1_c_1_     b2

以另一种方式拆分

separate(duplicates, sample, into = c("strain", "sample"), 
         sep = "(?<=[^_])(?=_[^_]+$)")
  strain sample
1    a_1    _b1
2   a1_2    _b1
3 a1_c_1    _b2

tidyr：分隔列，同时在第一列中保留定界符

tidyr: separate column while retaining delimiter in the first column

r

delimiter

tidyr