tidyverse separate_rows() 可以在每个分隔符之前保留信息吗?

Can tidyverse separate_rows() retain information before each delimiter?

我希望使用 separate_rows() 为数据框中的每个组装和分类等级创建一行,以便我可以在保留完整字符串的同时总结每个分类等级的基因组长度到每个分号分隔符。 我已经通过 mutate() 中的一堆 if_else 语句成功地做到了这一点并且它有效但我想知道是否有人有更优雅的解决方案我可以在未来的类似情况下使用。

下面包含单个程序集示例的输入、我的当前代码和输出 - 这将在实践中用于数千个程序集。

谢谢, 计算器

df <- data.frame(Assembly = 'GCA_00001', Length = 5370060, Taxonomy = 'd__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae_G;g__Bacillus_A;s__Baci
llus_A anthracis')

df %>%
  mutate(Lineage = Taxonomy) %>%
  separate_rows(Taxonomy, sep = ';') %>%
  mutate(Rank = str_remove(Taxonomy, '__.*') %>% toupper) %>%
  group_by(Taxonomy, Rank, Lineage) %>% 
  summarise(MeanLength = mean(Length),
            MedianLength = median(Length)) %>%
  mutate(Rank = ordered(Rank, levels = c('D', 'P', 'C', 'O', 'F', 'G', 'S'))) %>%
  arrange(Rank) %>%
  mutate(Lineage = if_else(Rank == 'D', str_remove(Lineage, ';p__.*'),
                           if_else(Rank == 'P', str_remove(Lineage, ';c__.*'),
                                   if_else(Rank == 'C', str_remove(Lineage, ';o__.*'),
                                           if_else(Rank == 'O', str_remove(Lineage, ';f__.*'),
                                                   if_else(Rank == 'F', str_remove(Lineage, ';g__.*'),
                                                           if_else(Rank == 'G', str_remove(Lineage, ';s__.*'), Lineage)))))))


Taxonomy                  Rank   Lineage                                                                                                     MeanLength   MedianLength
-----------------------   ----   ---------------------------------------------------------------------------------------------------------   ----------   ------------
d__Bacteria               D      d__Bacteria                                                                                                 5370060      5370060
p__Firmicutes             P      d__Bacteria;p__Firmicutes                                                                                   5370060      5370060
c__Bacilli                C      d__Bacteria;p__Firmicutes;c__Bacilli                                                                        5370060      5370060
o__Bacillales             O      d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales                                                          5370060      5370060
f__Bacillaceae_G          F      d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae_G                                         5370060      5370060
g__Bacillus_A             G      d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae_G;g__Bacillus_A                           5370060      5370060
s__Bacillus_A anthracis   S      d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae_G;g__Bacillus_A;s__Bacillus_A anthracis   5370060      5370060

您可以使用 accumulate 函数来帮助解决此问题。请注意,您的分类法需要按照正确的顺序才能正常工作。粗略代码如下:

library(tidyverse)

df <- data.frame(Assembly = 'GCA_00001', Length = 5370060, Taxonomy = 'd__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae_G;g__Bacillus_A;s__Bacillus_A anthracis')

df %>% 
  separate_rows(Taxonomy, sep = ";") %>% 
  mutate(Rank = toupper(str_sub(Taxonomy, 1, 1))) %>% 
  mutate(Lineage = accumulate(.x = paste0(Taxonomy, ";"), .f = paste0))