如何使用查找 table 替换 data.table 列中的值? [R]
How do I replace values in a data.table's column using a look up table? [R]
我有一个密钥和大量元数据table。元数据 table 中有一列包含如下值:
body_site
Lung
Lung
Brain - Amygdala
Brain - Amygdala
Brain - Caudate (basal ganglia)
Brain - Caudate (basal ganglia)
Lung
Lung
Skin - Sun Exposed (Lower leg)
Skin - Sun Exposed (Lower leg)
Brain - Spinal cord (cervical c-1)
Brain - Spinal cord (cervical c-1)
与 body_site
作为 header。密钥如下所示:
Tissue,Key
Adipose - Subcutaneous,ADPSBQ
Adipose - Visceral (Omentum),ADPVSC
Adrenal Gland,ADRNLG
Artery - Aorta,ARTAORT
Artery - Coronary,ARTACRN
Artery - Tibial,ARTTBL
Bladder,BLDDER
Brain - Amygdala,BRNAMY
Brain - Anterior cingulate cortex (BA24),BRNACC
它是csv
每种组织的相应缩写。我想要做的是用第二个 table 第二列中的相应缩写替换第一个 table 列中的所有条目。
问题是,当我接受 的建议时,它演示了如何做到这一点,但我最终得到了一个 table, 仅 具有 body_site
列的值;换句话说,table 中的所有其他数据都被删除,除了被替换的数据。从好的方面来说,替换工作有效,但现在我有一个完全空的 table,除了 headers.
我的代码如下所示。我包括了顶级回答者提供的两种解决方案,我都尝试过。
library("data.table")
args = commandArgs(trailingOnly=TRUE)
# SraRunTable.txt is args[1]
#sratabl <- fread(args[1])
sratabl <- fread("SraRunTable.txt")
tiskey <- fread("GTExTissueKey.csv")
# current directory is args [2]
new <- sratabl # create a copy of df
# using lapply, loop over columns and match values to the look up table. store in "new".
new[] <- lapply(sratabl, function(x) tiskey$Key[match(x, tiskey$Tissue)])
new <- sratabl
new[] <- tiskey$Key[match(unlist(sratabl), tiskey$Tissue)]
解决方法如下:
require(data.table)
df1 <- data.frame(a = c("a","b","c"), b = c("x","y","z"))
df2 <- data.frame(a = c("a","c"), b = c("new_x","new_z"))
setDT(df1)
setDT(df2)
# inspect each df
df1
# a b
# 1: a x
# 2: b y
# 3: c z
df2
# a b
# 1: a new_x
# 2: c new_z
l <- match(df1$a, df2$a, nomatch = 0)
df1$b[l != 0] <- df2$b[l]
df1
# a b
# 1: a new_x
# 2: b y
# 3: c new_z
- 我认为你过度使用了
lapply
;由于您正在处理框架中的单个列,因此无需在此处使用它。
- 结果中会有
NA
s,至少对于这个数据(无论如何你应该提防它)。因此,我建议使用 intermediate/temp 变量。
对于上面的#2,为了便于关联,我会将变量保留在框架内(然后将其删除),虽然没有必要,但它可以很容易地存储在一个独立的向量中,然后修复后分配。
df1$tmp <- df2$Key[ match(df1$body_site, df2$Tissue) ]
head(df1)
# body_site tmp
# 1 Lung <NA>
# 2 Lung <NA>
# 3 Brain - Amygdala BRNAMY
# 4 Brain - Amygdala BRNAMY
# 5 Brain - Caudate (basal ganglia) <NA>
# 6 Brain - Caudate (basal ganglia) <NA>
这些是您需要警惕的 NA
...下一部分仅在没有 NA
.
时才使用新列
df1$tmp <- ifelse(is.na(df1$tmp), df1$body_site, df1$tmp)
head(df1)
# body_site tmp
# 1 Lung Lung
# 2 Lung Lung
# 3 Brain - Amygdala BRNAMY
# 4 Brain - Amygdala BRNAMY
# 5 Brain - Caudate (basal ganglia) Brain - Caudate (basal ganglia)
# 6 Brain - Caudate (basal ganglia) Brain - Caudate (basal ganglia)
现在,清理:
df1$body_site <- df1$tmp
df1$tmp <- NULL
备选方案:加入。
library(dplyr)
left_join(df1, df2, by=c("body_site" = "Tissue")) %>% head()
# body_site Key
# 1 Lung <NA>
# 2 Lung <NA>
# 3 Brain - Amygdala BRNAMY
# 4 Brain - Amygdala BRNAMY
# 5 Brain - Caudate (basal ganglia) <NA>
# 6 Brain - Caudate (basal ganglia) <NA>
(需要相同的清理)
library(data.table)
head( merge(df1, df2, by.x="body_site", by.y="Tissue", all.x=TRUE) )
# body_site Key
# 1: Brain - Amygdala BRNAMY
# 2: Brain - Amygdala BRNAMY
# 3: Brain - Caudate (basal ganglia) <NA>
# 4: Brain - Caudate (basal ganglia) <NA>
# 5: Brain - Spinal cord (cervical c-1) <NA>
# 6: Brain - Spinal cord (cervical c-1) <NA>
(需要相同的清理)
数据:
df1 <- read.csv(header=T, stringsAsFactors=F, text='
body_site
Lung
Lung
Brain - Amygdala
Brain - Amygdala
Brain - Caudate (basal ganglia)
Brain - Caudate (basal ganglia)
Lung
Lung
Skin - Sun Exposed (Lower leg)
Skin - Sun Exposed (Lower leg)
Brain - Spinal cord (cervical c-1)
Brain - Spinal cord (cervical c-1)')
df2 <- read.csv(header=T, stringsAsFactors=F, text='
Tissue,Key
Adipose - Subcutaneous,ADPSBQ
Adipose - Visceral (Omentum),ADPVSC
Adrenal Gland,ADRNLG
Artery - Aorta,ARTAORT
Artery - Coronary,ARTACRN
Artery - Tibial,ARTTBL
Bladder,BLDDER
Brain - Amygdala,BRNAMY
Brain - Anterior cingulate cortex (BA24),BRNACC')
我有一个密钥和大量元数据table。元数据 table 中有一列包含如下值:
body_site
Lung
Lung
Brain - Amygdala
Brain - Amygdala
Brain - Caudate (basal ganglia)
Brain - Caudate (basal ganglia)
Lung
Lung
Skin - Sun Exposed (Lower leg)
Skin - Sun Exposed (Lower leg)
Brain - Spinal cord (cervical c-1)
Brain - Spinal cord (cervical c-1)
与 body_site
作为 header。密钥如下所示:
Tissue,Key
Adipose - Subcutaneous,ADPSBQ
Adipose - Visceral (Omentum),ADPVSC
Adrenal Gland,ADRNLG
Artery - Aorta,ARTAORT
Artery - Coronary,ARTACRN
Artery - Tibial,ARTTBL
Bladder,BLDDER
Brain - Amygdala,BRNAMY
Brain - Anterior cingulate cortex (BA24),BRNACC
它是csv
每种组织的相应缩写。我想要做的是用第二个 table 第二列中的相应缩写替换第一个 table 列中的所有条目。
问题是,当我接受 body_site
列的值;换句话说,table 中的所有其他数据都被删除,除了被替换的数据。从好的方面来说,替换工作有效,但现在我有一个完全空的 table,除了 headers.
我的代码如下所示。我包括了顶级回答者提供的两种解决方案,我都尝试过。
library("data.table")
args = commandArgs(trailingOnly=TRUE)
# SraRunTable.txt is args[1]
#sratabl <- fread(args[1])
sratabl <- fread("SraRunTable.txt")
tiskey <- fread("GTExTissueKey.csv")
# current directory is args [2]
new <- sratabl # create a copy of df
# using lapply, loop over columns and match values to the look up table. store in "new".
new[] <- lapply(sratabl, function(x) tiskey$Key[match(x, tiskey$Tissue)])
new <- sratabl
new[] <- tiskey$Key[match(unlist(sratabl), tiskey$Tissue)]
解决方法如下:
require(data.table)
df1 <- data.frame(a = c("a","b","c"), b = c("x","y","z"))
df2 <- data.frame(a = c("a","c"), b = c("new_x","new_z"))
setDT(df1)
setDT(df2)
# inspect each df
df1
# a b
# 1: a x
# 2: b y
# 3: c z
df2
# a b
# 1: a new_x
# 2: c new_z
l <- match(df1$a, df2$a, nomatch = 0)
df1$b[l != 0] <- df2$b[l]
df1
# a b
# 1: a new_x
# 2: b y
# 3: c new_z
- 我认为你过度使用了
lapply
;由于您正在处理框架中的单个列,因此无需在此处使用它。 - 结果中会有
NA
s,至少对于这个数据(无论如何你应该提防它)。因此,我建议使用 intermediate/temp 变量。
对于上面的#2,为了便于关联,我会将变量保留在框架内(然后将其删除),虽然没有必要,但它可以很容易地存储在一个独立的向量中,然后修复后分配。
df1$tmp <- df2$Key[ match(df1$body_site, df2$Tissue) ]
head(df1)
# body_site tmp
# 1 Lung <NA>
# 2 Lung <NA>
# 3 Brain - Amygdala BRNAMY
# 4 Brain - Amygdala BRNAMY
# 5 Brain - Caudate (basal ganglia) <NA>
# 6 Brain - Caudate (basal ganglia) <NA>
这些是您需要警惕的 NA
...下一部分仅在没有 NA
.
df1$tmp <- ifelse(is.na(df1$tmp), df1$body_site, df1$tmp)
head(df1)
# body_site tmp
# 1 Lung Lung
# 2 Lung Lung
# 3 Brain - Amygdala BRNAMY
# 4 Brain - Amygdala BRNAMY
# 5 Brain - Caudate (basal ganglia) Brain - Caudate (basal ganglia)
# 6 Brain - Caudate (basal ganglia) Brain - Caudate (basal ganglia)
现在,清理:
df1$body_site <- df1$tmp
df1$tmp <- NULL
备选方案:加入。
library(dplyr)
left_join(df1, df2, by=c("body_site" = "Tissue")) %>% head()
# body_site Key
# 1 Lung <NA>
# 2 Lung <NA>
# 3 Brain - Amygdala BRNAMY
# 4 Brain - Amygdala BRNAMY
# 5 Brain - Caudate (basal ganglia) <NA>
# 6 Brain - Caudate (basal ganglia) <NA>
(需要相同的清理)
library(data.table)
head( merge(df1, df2, by.x="body_site", by.y="Tissue", all.x=TRUE) )
# body_site Key
# 1: Brain - Amygdala BRNAMY
# 2: Brain - Amygdala BRNAMY
# 3: Brain - Caudate (basal ganglia) <NA>
# 4: Brain - Caudate (basal ganglia) <NA>
# 5: Brain - Spinal cord (cervical c-1) <NA>
# 6: Brain - Spinal cord (cervical c-1) <NA>
(需要相同的清理)
数据:
df1 <- read.csv(header=T, stringsAsFactors=F, text='
body_site
Lung
Lung
Brain - Amygdala
Brain - Amygdala
Brain - Caudate (basal ganglia)
Brain - Caudate (basal ganglia)
Lung
Lung
Skin - Sun Exposed (Lower leg)
Skin - Sun Exposed (Lower leg)
Brain - Spinal cord (cervical c-1)
Brain - Spinal cord (cervical c-1)')
df2 <- read.csv(header=T, stringsAsFactors=F, text='
Tissue,Key
Adipose - Subcutaneous,ADPSBQ
Adipose - Visceral (Omentum),ADPVSC
Adrenal Gland,ADRNLG
Artery - Aorta,ARTAORT
Artery - Coronary,ARTACRN
Artery - Tibial,ARTTBL
Bladder,BLDDER
Brain - Amygdala,BRNAMY
Brain - Anterior cingulate cortex (BA24),BRNACC')