将 Drugbank ID 替换为药物名称

Question

我有一个与来自 DrugBank 的药物相关的基因数据集。我希望简单地将所有药库 ID 转换为人类可读的药物名称。如您所见，我的主要问题是某些基因与多种甚至数百种药物有关。多个药物 ID 在同一个逗号分隔的“列”中 R studio“匹配”或“合并功能”仅适用于每列中的第一个标识符，从而有效地删除同一列“单元格”中的其余部分。我已经找到了在 excel 上为我的最佳候选人手动执行此操作的方法，但这对于我的 3000 个基因数据集来说是不现实的。

理想情况下，我想做一些类似于“文本到列”但在行中的操作，这样每一行都会保留其所有其他值，但只保留单元格中多个药库 ID 之一，然后可以只使用匹配函数来替换它们。

drugbank 词汇表 (.csv) 如下所示：[DBvocabulary.csv]

DrugBank.ID Common.name

DB00001 来匹卢定

DB00002 西妥昔单抗

DB00003 阿尔法链酶

DB00004 丹尼白介素双毒

DB00005 依那西普

DB00006 比伐卢定

我的数据集 (.csv) 有 15 列，但重要的是：

[all_ph_active.csv]

Gene.Name DrugBank.ID

F8 DB09130

TCN2 DB00200

LDLR DB09270； DB11251； DB14003

ALB DB00070； DB00137； DB00159； DB00162； DB00214;

欢迎任何建议，提前致谢！

Answer 1

一种方法是将名称列连接到原始数据框。

我在下面提供了一个小例子

library(tidyverse)

Translation <- tribble(~"ID", ~"Name",
                 "I001", "name1",
                 "I002", "name2",
                 "I003", "name3",
                 "I004", "name4",)


df <- tribble(~"ID",
              "I001",
              "I001",
              "I004",
              "I004",
              "I002",
              "I002",
              "I001",
              "I002",
              "I003",
              "I003",
              "I004",
              "I002",
              "I001"
              )
                  
right_join(df, Translation, by=c("ID" ="ID"))
#> # A tibble: 13 x 2
#>    ID    Name 
#>    <chr> <chr>
#>  1 I001  name1
#>  2 I001  name1
#>  3 I004  name4
#>  4 I004  name4
#>  5 I002  name2
#>  6 I002  name2
#>  7 I001  name1
#>  8 I002  name2
#>  9 I003  name3
#> 10 I003  name3
#> 11 I004  name4
#> 12 I002  name2
#> 13 I001  name1

^{由 reprex package (v2.0.0)}

于 2021-04-03 创建

但是，此示例并未说明所提供的多个潜在名称。解决此问题的一种方法是暂时为每种药物创建多个条目，如下例所示，然后将名称格式化为原始格式。

我所做的一个假设是，药物列在一个字符数组中，每个分号后跟一个 space 作为分隔符。请纠正我这一点，我会相应地更新代码：

library(tidyverse)

Translation <- tribble(~"ID", ~"Name",
                 "I001", "name1",
                 "I002", "name2",
                 "I003", "name3",
                 "I004", "name4",)


df <- tribble(~"ID",
              "I001",
              "I001",
              "I004; I002",
              "I004",
              "I002",
              "I002",
              "I001; I003",
              "I002",
              "I003",
              "I003",
              "I004",
              "I002; I001",
              "I001"
              )
                  
df_with_uniqueID <- df %>% 
  #Creates unique identifier for each row
  mutate(uniqueNum = 1: length(df$ID)) 

# Replace IDs in characters with array of IDs
df_with_uniqueID$ID <- strsplit(df_with_uniqueID$ID, split = "; ")

# Give each ID its own column
unnest(df_with_uniqueID, cols = c(ID)) %>% 
  #right_join the results
  right_join(Translation, by = c("ID" = "ID")) %>% 
  #reduce the additional columns
  nest(cols = c(ID, Name)) %>% 
  # Convert the array of names to a single string
  mutate(names = map(cols, function(x) paste(x$Name, collapse = "; "))) %>% 
  # Unnest our strings to a column
  unnest(names) %>% 
  # Remove the column we no longer need
  select(-cols)
#> # A tibble: 13 x 2
#>    uniqueNum names       
#>        <int> <chr>       
#>  1         1 name1       
#>  2         2 name1       
#>  3         3 name4; name2
#>  4         4 name4       
#>  5         5 name2       
#>  6         6 name2       
#>  7         7 name1; name3
#>  8         8 name2       
#>  9         9 name3       
#> 10        10 name3       
#> 11        11 name4       
#> 12        12 name2; name1
#> 13        13 name1

^{由 reprex package (v2.0.0)}

于 2021-04-03 创建

将 Drugbank ID 替换为药物名称

Replace Drugbank IDs with Drug name

replace

r

multiple-columns