tidyverse separate_rows() 可以在每个分隔符之前保留信息吗?
Can tidyverse separate_rows() retain information before each delimiter?
我希望使用 separate_rows() 为数据框中的每个组装和分类等级创建一行,以便我可以在保留完整字符串的同时总结每个分类等级的基因组长度到每个分号分隔符。
我已经通过 mutate() 中的一堆 if_else 语句成功地做到了这一点并且它有效但我想知道是否有人有更优雅的解决方案我可以在未来的类似情况下使用。
下面包含单个程序集示例的输入、我的当前代码和输出 - 这将在实践中用于数千个程序集。
谢谢,
计算器
df <- data.frame(Assembly = 'GCA_00001', Length = 5370060, Taxonomy = 'd__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae_G;g__Bacillus_A;s__Baci
llus_A anthracis')
df %>%
mutate(Lineage = Taxonomy) %>%
separate_rows(Taxonomy, sep = ';') %>%
mutate(Rank = str_remove(Taxonomy, '__.*') %>% toupper) %>%
group_by(Taxonomy, Rank, Lineage) %>%
summarise(MeanLength = mean(Length),
MedianLength = median(Length)) %>%
mutate(Rank = ordered(Rank, levels = c('D', 'P', 'C', 'O', 'F', 'G', 'S'))) %>%
arrange(Rank) %>%
mutate(Lineage = if_else(Rank == 'D', str_remove(Lineage, ';p__.*'),
if_else(Rank == 'P', str_remove(Lineage, ';c__.*'),
if_else(Rank == 'C', str_remove(Lineage, ';o__.*'),
if_else(Rank == 'O', str_remove(Lineage, ';f__.*'),
if_else(Rank == 'F', str_remove(Lineage, ';g__.*'),
if_else(Rank == 'G', str_remove(Lineage, ';s__.*'), Lineage)))))))
Taxonomy Rank Lineage MeanLength MedianLength
----------------------- ---- --------------------------------------------------------------------------------------------------------- ---------- ------------
d__Bacteria D d__Bacteria 5370060 5370060
p__Firmicutes P d__Bacteria;p__Firmicutes 5370060 5370060
c__Bacilli C d__Bacteria;p__Firmicutes;c__Bacilli 5370060 5370060
o__Bacillales O d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales 5370060 5370060
f__Bacillaceae_G F d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae_G 5370060 5370060
g__Bacillus_A G d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae_G;g__Bacillus_A 5370060 5370060
s__Bacillus_A anthracis S d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae_G;g__Bacillus_A;s__Bacillus_A anthracis 5370060 5370060
您可以使用 accumulate
函数来帮助解决此问题。请注意,您的分类法需要按照正确的顺序才能正常工作。粗略代码如下:
library(tidyverse)
df <- data.frame(Assembly = 'GCA_00001', Length = 5370060, Taxonomy = 'd__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae_G;g__Bacillus_A;s__Bacillus_A anthracis')
df %>%
separate_rows(Taxonomy, sep = ";") %>%
mutate(Rank = toupper(str_sub(Taxonomy, 1, 1))) %>%
mutate(Lineage = accumulate(.x = paste0(Taxonomy, ";"), .f = paste0))
我希望使用 separate_rows() 为数据框中的每个组装和分类等级创建一行,以便我可以在保留完整字符串的同时总结每个分类等级的基因组长度到每个分号分隔符。 我已经通过 mutate() 中的一堆 if_else 语句成功地做到了这一点并且它有效但我想知道是否有人有更优雅的解决方案我可以在未来的类似情况下使用。
下面包含单个程序集示例的输入、我的当前代码和输出 - 这将在实践中用于数千个程序集。
谢谢, 计算器
df <- data.frame(Assembly = 'GCA_00001', Length = 5370060, Taxonomy = 'd__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae_G;g__Bacillus_A;s__Baci
llus_A anthracis')
df %>%
mutate(Lineage = Taxonomy) %>%
separate_rows(Taxonomy, sep = ';') %>%
mutate(Rank = str_remove(Taxonomy, '__.*') %>% toupper) %>%
group_by(Taxonomy, Rank, Lineage) %>%
summarise(MeanLength = mean(Length),
MedianLength = median(Length)) %>%
mutate(Rank = ordered(Rank, levels = c('D', 'P', 'C', 'O', 'F', 'G', 'S'))) %>%
arrange(Rank) %>%
mutate(Lineage = if_else(Rank == 'D', str_remove(Lineage, ';p__.*'),
if_else(Rank == 'P', str_remove(Lineage, ';c__.*'),
if_else(Rank == 'C', str_remove(Lineage, ';o__.*'),
if_else(Rank == 'O', str_remove(Lineage, ';f__.*'),
if_else(Rank == 'F', str_remove(Lineage, ';g__.*'),
if_else(Rank == 'G', str_remove(Lineage, ';s__.*'), Lineage)))))))
Taxonomy Rank Lineage MeanLength MedianLength
----------------------- ---- --------------------------------------------------------------------------------------------------------- ---------- ------------
d__Bacteria D d__Bacteria 5370060 5370060
p__Firmicutes P d__Bacteria;p__Firmicutes 5370060 5370060
c__Bacilli C d__Bacteria;p__Firmicutes;c__Bacilli 5370060 5370060
o__Bacillales O d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales 5370060 5370060
f__Bacillaceae_G F d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae_G 5370060 5370060
g__Bacillus_A G d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae_G;g__Bacillus_A 5370060 5370060
s__Bacillus_A anthracis S d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae_G;g__Bacillus_A;s__Bacillus_A anthracis 5370060 5370060
您可以使用 accumulate
函数来帮助解决此问题。请注意,您的分类法需要按照正确的顺序才能正常工作。粗略代码如下:
library(tidyverse)
df <- data.frame(Assembly = 'GCA_00001', Length = 5370060, Taxonomy = 'd__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae_G;g__Bacillus_A;s__Bacillus_A anthracis')
df %>%
separate_rows(Taxonomy, sep = ";") %>%
mutate(Rank = toupper(str_sub(Taxonomy, 1, 1))) %>%
mutate(Lineage = accumulate(.x = paste0(Taxonomy, ";"), .f = paste0))