将(征税)分类法数据的嵌套列表转换为数据框
Turn a nested list of (taxize) taxonomy data into a data frame
我有一些生物(微生物组)数据,其中有一堆 OTU,它们的名称在属和门级别之间的分类学分辨率各不相同。我正在尝试获得一个 table 所有比我给出的名称更低级别的分类法。
testnames <- c("Prevotella", "Bacteroides", "Enterobacteriaceae")
我发现 taxize 是一个有用的包,可以提取我正在寻找的信息。
library("taxize")
reclass <- classification(testnames, db = 'ncbi')
这给我一个数据框列表
看起来像这样:
并且可以这样输入到R中:
structure(list(Prevotella = structure(list(name = c("cellular organisms",
"Bacteria", "FCB group", "Bacteroidetes/Chlorobi group", "Bacteroidetes",
"Bacteroidia", "Bacteroidales", "Prevotellaceae", "Prevotella"
), rank = c("no rank", "superkingdom", "no rank", "no rank",
"phylum", "class", "order", "family", "genus"), id = c("131567",
"2", "1783270", "68336", "976", "200643", "171549", "171552",
"838")), .Names = c("name", "rank", "id"), row.names = c(NA,
-9L), class = "data.frame"), Bacteroides = structure(list(name = c("cellular organisms",
"Bacteria", "FCB group", "Bacteroidetes/Chlorobi group", "Bacteroidetes",
"Bacteroidia", "Bacteroidales", "Bacteroidaceae", "Bacteroides"
), rank = c("no rank", "superkingdom", "no rank", "no rank",
"phylum", "class", "order", "family", "genus"), id = c("131567",
"2", "1783270", "68336", "976", "200643", "171549", "815", "816"
)), .Names = c("name", "rank", "id"), row.names = c(NA, -9L), class = "data.frame"),
Enterobacteriaceae = structure(list(name = c("cellular organisms",
"Bacteria", "Proteobacteria", "Gammaproteobacteria", "Enterobacterales",
"Enterobacteriaceae"), rank = c("no rank", "superkingdom",
"phylum", "class", "order", "family"), id = c("131567", "2",
"1224", "1236", "91347", "543")), .Names = c("name", "rank",
"id"), row.names = c(NA, -6L), class = "data.frame")), .Names = c("Prevotella",
"Bacteroides", "Enterobacteriaceae"))
我真的很想把东西变成一个数据框,我可以将它作为分类法导入到 phyloseq 中 table。例如。看起来像的东西:
name Phylum Class Order Family Genus
Prevotella Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae Prevotella
Bacteroides Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
Enterobacteriaceae Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae
当然,一种方法是创建一个循环,遍历列表的每个元素,找到调用的变量 phylum,然后将其放入新的数据框中。也就是说,我觉得应该有一种更快的方法来应用这种转换,使用 plyr 或 dplyr 之类的东西。
我看到了一些看起来很接近的东西:
Converting nested list to dataframe
但他们似乎假设不想保存的数据较少,并且每个元素的数据帧大小均匀。有什么建议吗?
使用 dplyr
和 tidyr
:
library(dplyr)
library(tidyr)
tibble(names = names(list), list) %>%
unnest() %>%
filter(rank %in% c("phylum","class","order","family","genus")) %>%
select(-id) %>%
spread(rank, name) %>%
select(name = names, phylum, class, order, family, genus)
# A tibble: 3 × 6
name phylum class order family genus
* <chr> <chr> <chr> <chr> <chr> <chr>
1 Bacteroides Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
2 Enterobacteriaceae Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae <NA>
3 Prevotella Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae Prevotella
这是做什么的:
- 使用列表名称和每个嵌套列表创建
tibble
- 取消列表嵌套
- 在排名列中过滤您想要的值
- 去掉 id 列
- 将排名行分散到列中,并填充名称中的值
- Select你想要的顺序,将names重命名为name。
我有一些生物(微生物组)数据,其中有一堆 OTU,它们的名称在属和门级别之间的分类学分辨率各不相同。我正在尝试获得一个 table 所有比我给出的名称更低级别的分类法。
testnames <- c("Prevotella", "Bacteroides", "Enterobacteriaceae")
我发现 taxize 是一个有用的包,可以提取我正在寻找的信息。
library("taxize")
reclass <- classification(testnames, db = 'ncbi')
这给我一个数据框列表
看起来像这样:
并且可以这样输入到R中:
structure(list(Prevotella = structure(list(name = c("cellular organisms",
"Bacteria", "FCB group", "Bacteroidetes/Chlorobi group", "Bacteroidetes",
"Bacteroidia", "Bacteroidales", "Prevotellaceae", "Prevotella"
), rank = c("no rank", "superkingdom", "no rank", "no rank",
"phylum", "class", "order", "family", "genus"), id = c("131567",
"2", "1783270", "68336", "976", "200643", "171549", "171552",
"838")), .Names = c("name", "rank", "id"), row.names = c(NA,
-9L), class = "data.frame"), Bacteroides = structure(list(name = c("cellular organisms",
"Bacteria", "FCB group", "Bacteroidetes/Chlorobi group", "Bacteroidetes",
"Bacteroidia", "Bacteroidales", "Bacteroidaceae", "Bacteroides"
), rank = c("no rank", "superkingdom", "no rank", "no rank",
"phylum", "class", "order", "family", "genus"), id = c("131567",
"2", "1783270", "68336", "976", "200643", "171549", "815", "816"
)), .Names = c("name", "rank", "id"), row.names = c(NA, -9L), class = "data.frame"),
Enterobacteriaceae = structure(list(name = c("cellular organisms",
"Bacteria", "Proteobacteria", "Gammaproteobacteria", "Enterobacterales",
"Enterobacteriaceae"), rank = c("no rank", "superkingdom",
"phylum", "class", "order", "family"), id = c("131567", "2",
"1224", "1236", "91347", "543")), .Names = c("name", "rank",
"id"), row.names = c(NA, -6L), class = "data.frame")), .Names = c("Prevotella",
"Bacteroides", "Enterobacteriaceae"))
我真的很想把东西变成一个数据框,我可以将它作为分类法导入到 phyloseq 中 table。例如。看起来像的东西:
name Phylum Class Order Family Genus
Prevotella Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae Prevotella
Bacteroides Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
Enterobacteriaceae Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae
当然,一种方法是创建一个循环,遍历列表的每个元素,找到调用的变量 phylum,然后将其放入新的数据框中。也就是说,我觉得应该有一种更快的方法来应用这种转换,使用 plyr 或 dplyr 之类的东西。
我看到了一些看起来很接近的东西:
Converting nested list to dataframe
但他们似乎假设不想保存的数据较少,并且每个元素的数据帧大小均匀。有什么建议吗?
使用 dplyr
和 tidyr
:
library(dplyr)
library(tidyr)
tibble(names = names(list), list) %>%
unnest() %>%
filter(rank %in% c("phylum","class","order","family","genus")) %>%
select(-id) %>%
spread(rank, name) %>%
select(name = names, phylum, class, order, family, genus)
# A tibble: 3 × 6
name phylum class order family genus
* <chr> <chr> <chr> <chr> <chr> <chr>
1 Bacteroides Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
2 Enterobacteriaceae Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae <NA>
3 Prevotella Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae Prevotella
这是做什么的:
- 使用列表名称和每个嵌套列表创建
tibble
- 取消列表嵌套
- 在排名列中过滤您想要的值
- 去掉 id 列
- 将排名行分散到列中,并填充名称中的值
- Select你想要的顺序,将names重命名为name。