使用来自另一个 table 的数据将列添加到 table
Add column to table with data from another table
我有一个 table,如下所示:
Table1 <- data.frame(
"Random" = c("A", "B", "C"),
"Genes" = c("Apple", "Candy", "Toothpaste"),
"Extra" = c("Up", "", "Down"),
"Desc" = c("Healthy,Red,Fruit", "Sweet,Cavities,Sugar,Fruity", "Minty,Dentist")
)
给予:
Random Genes Extra Desc
1 A Apple Up Healthy,Red,Fruit
2 B Candy Sweet,Cavities,Sugar,Fruity
3 C Toothpaste Down Minty,Dentist
我有另一个 table 的描述,想添加一个包含基因的列。例如 Table2 将是:
Table2 <- data.frame(
"Col1" = c(1, 2, 3, 4, 5, 6),
"Desc" = c("Sweet", "Sugar", "Dentist", "Red", "Fruit", "Fruity")
)
给予:
Col1 Desc
1 1 Sweet
2 2 Sugar
3 3 Dentist
4 4 Red
5 5 Fruit
6 6 Fruity
我想在 Table2 中添加另一列名为 "Genes" 的列,该列与 table 中的 "Desc" 相匹配,并添加 Table1 中的基因以获取:
Col1 Desc Gene
1 1 Sweet Candy
2 2 Sugar Candy
3 3 Dentist Toothpaste
4 4 Red Apple
5 5 Fruit Apple
6 6 Fruity Candy
您可以尝试 cSplit
从 splitstackshape
拆分 "Table1" 中的 'Desc' 列并将数据集从 'wide' 转换为 'long'格式。输出将是 data.table
。我们可以使用 data.table
方法将键列设置为 'Desc' (setkey
),与 "Table2" 连接,最后删除输出中不需要的列通过选择列或将不需要的列分配 (:=
) 为 NULL
library(splitstackshape)
setkey(cSplit(Table1, 'Desc', ',', 'long'),Desc)[Table2[2:1]][
,c(5,4,2), with=FALSE]
# Col1 Desc Genes
#1: 1 Sweet Candy
#2: 2 Sugar Candy
#3: 3 Dentist Toothpaste
#4: 4 Red Apple
#5: 5 Fruit Apple
#6: 6 Fruity Candy
这是使用中间 linking table 的基础 R 中的方法:
# create an intermediate data.frame with all the key (Desc) / value (Gene) pairs
df <- NULL
for(i in seq(nrow(Table1)))
df <- rbind(df,
data.frame(Gene =Table1$Genes[i],
Desc =strsplit(as.character(Table1$Desc)[i],',')[[1]]))
df
#> Gene Desc
#> 1 Apple Healthy
#> 2 Apple Red
#> 3 Apple Fruit
#> 4 Candy Sweet
#> 5 Candy Cavities
#> 6 Candy Sugar
#> 7 Candy Fruity
#> 8 Toothpaste Minty
#> 9 Toothpaste Dentist
现在 link 以通常的方式进行:
Table2$Gene <- df$Gene[match(Table2$Desc,df$Desc)]
假设每个字符串都是唯一的(即 Fruit 不能出现超过一个 Gene),您可以使用 for
循环和 grep
相当容易地做到这一点。但是,它在庞大的数据集上可能会很慢。
options(stringsAsFactors = FALSE)
Table1 <- data.frame("Random" = c("A", "B", "C"), "Genes" = c("Apple", "Candy", "Toothpaste"), "Extra" = c("Up", "", "Down"), "Desc" = c("Healthy,Red,Fruit", "Sweet,Cavities,Sugar,Fruity", "Minty,Dentist"))
Table2 <- data.frame("Col1" = c(1, 2, 3, 4, 5, 6), "Desc" = c("Sweet", "Sugar", "Dentist", "Red", "Fruit", "Fruity"))
Table2$Gene <- NA
for(x in 1:nrow(Table2)) {
Table2[x,"Gene"] <- Table1$Genes[grep(pattern = paste("\b",Table2$Desc[x],"\b",sep=""),x = Table1$Desc)]
}
Table2
Col1 Desc Gene
1 1 Sweet Candy
2 2 Sugar Candy
3 3 Dentist Toothpaste
4 4 Red Apple
5 5 Fruit Apple
6 6 Fruity Candy
如果我们可以在命名列表或 2 个向量(例如,2 列数据框)中进行键查找,我们可以使用我维护的 *qdapTools** 包中的 %l%
函数。首先,我将使用 strsplit
函数将您的 Table1$desc
拆分为一个命名列表。那是关键。我们可以通过 Table2$Desc
进行查找。这在后端使用 *data.table** 包,所以速度非常快:
library(qdapTools)
key <- setNames(strsplit(as.character(Table1[["Desc"]]), "\s*,\s*"), Table1[["Genes"]])
## $Apple
## [1] "Healthy" "Red" "Fruit"
##
## $Candy
## [1] "Sweet" "Cavities" "Sugar" "Fruity"
##
## $Toothpaste
## [1] "Minty" "Dentist"
Table2[["Gene"]] <- Table2[["Desc"]] %l% key
## Col1 Desc Gene
## 1 1 Sweet Candy
## 2 2 Sugar Candy
## 3 3 Dentist Toothpaste
## 4 4 Red Apple
## 5 5 Fruit Apple
## 6 6 Fruity Candy
这是一个纯基矢量查找,应该也非常快:
x <- strsplit(as.character(Table1[["Desc"]]), "\s*,\s*")
key <- setNames(rep(Table1[["Genes"]], sapply(x, length)), unlist(x))
Table2[["Gene"]] <- key[match(Table2[["Desc"]], names(key))]
按照@TylerRinker 的回答,我首先格式化 Table1$Desc
字符串:
Table1a <- with(Table1,
stack(setNames(sapply(as.character(Desc),strsplit,split=","),Genes)))
names(Table1a) <- c("Desc","Genes")
然后转到data.table
:
require(data.table)
DT1 <- data.table(Table1a,key="Desc")
DT2 <- data.table(Table2,key="Desc")
然后合并-n-定义:
DT2[DT1,Gene:=Genes]
# Col1 Desc Gene
# 1: 3 Dentist Toothpaste
# 2: 5 Fruit Apple
# 3: 6 Fruity Candy
# 4: 4 Red Apple
# 5: 2 Sugar Candy
# 6: 1 Sweet Candy
假设没有太多要匹配的词,这里有一个使用一些 tidyverse
函数的选项:
library(tidyverse)
crossing(Table1, Table2) %>%
mutate_if(is.factor, as.character) %>%
rowwise() %>%
filter(str_detect(Desc, Desc1)) %>%
select(Col1, Desc = Desc1, Genes) %>%
arrange(Col1)
# A tibble: 7 x 3
Col1 Desc Genes
<dbl> <chr> <chr>
1 1 Sweet Candy
2 2 Sugar Candy
3 3 Dentist Toothpaste
4 4 Red Apple
5 5 Fruit Apple
6 5 Fruit Candy
7 6 Fruity Candy
我有一个 table,如下所示:
Table1 <- data.frame(
"Random" = c("A", "B", "C"),
"Genes" = c("Apple", "Candy", "Toothpaste"),
"Extra" = c("Up", "", "Down"),
"Desc" = c("Healthy,Red,Fruit", "Sweet,Cavities,Sugar,Fruity", "Minty,Dentist")
)
给予:
Random Genes Extra Desc
1 A Apple Up Healthy,Red,Fruit
2 B Candy Sweet,Cavities,Sugar,Fruity
3 C Toothpaste Down Minty,Dentist
我有另一个 table 的描述,想添加一个包含基因的列。例如 Table2 将是:
Table2 <- data.frame(
"Col1" = c(1, 2, 3, 4, 5, 6),
"Desc" = c("Sweet", "Sugar", "Dentist", "Red", "Fruit", "Fruity")
)
给予:
Col1 Desc
1 1 Sweet
2 2 Sugar
3 3 Dentist
4 4 Red
5 5 Fruit
6 6 Fruity
我想在 Table2 中添加另一列名为 "Genes" 的列,该列与 table 中的 "Desc" 相匹配,并添加 Table1 中的基因以获取:
Col1 Desc Gene
1 1 Sweet Candy
2 2 Sugar Candy
3 3 Dentist Toothpaste
4 4 Red Apple
5 5 Fruit Apple
6 6 Fruity Candy
您可以尝试 cSplit
从 splitstackshape
拆分 "Table1" 中的 'Desc' 列并将数据集从 'wide' 转换为 'long'格式。输出将是 data.table
。我们可以使用 data.table
方法将键列设置为 'Desc' (setkey
),与 "Table2" 连接,最后删除输出中不需要的列通过选择列或将不需要的列分配 (:=
) 为 NULL
library(splitstackshape)
setkey(cSplit(Table1, 'Desc', ',', 'long'),Desc)[Table2[2:1]][
,c(5,4,2), with=FALSE]
# Col1 Desc Genes
#1: 1 Sweet Candy
#2: 2 Sugar Candy
#3: 3 Dentist Toothpaste
#4: 4 Red Apple
#5: 5 Fruit Apple
#6: 6 Fruity Candy
这是使用中间 linking table 的基础 R 中的方法:
# create an intermediate data.frame with all the key (Desc) / value (Gene) pairs
df <- NULL
for(i in seq(nrow(Table1)))
df <- rbind(df,
data.frame(Gene =Table1$Genes[i],
Desc =strsplit(as.character(Table1$Desc)[i],',')[[1]]))
df
#> Gene Desc
#> 1 Apple Healthy
#> 2 Apple Red
#> 3 Apple Fruit
#> 4 Candy Sweet
#> 5 Candy Cavities
#> 6 Candy Sugar
#> 7 Candy Fruity
#> 8 Toothpaste Minty
#> 9 Toothpaste Dentist
现在 link 以通常的方式进行:
Table2$Gene <- df$Gene[match(Table2$Desc,df$Desc)]
假设每个字符串都是唯一的(即 Fruit 不能出现超过一个 Gene),您可以使用 for
循环和 grep
相当容易地做到这一点。但是,它在庞大的数据集上可能会很慢。
options(stringsAsFactors = FALSE)
Table1 <- data.frame("Random" = c("A", "B", "C"), "Genes" = c("Apple", "Candy", "Toothpaste"), "Extra" = c("Up", "", "Down"), "Desc" = c("Healthy,Red,Fruit", "Sweet,Cavities,Sugar,Fruity", "Minty,Dentist"))
Table2 <- data.frame("Col1" = c(1, 2, 3, 4, 5, 6), "Desc" = c("Sweet", "Sugar", "Dentist", "Red", "Fruit", "Fruity"))
Table2$Gene <- NA
for(x in 1:nrow(Table2)) {
Table2[x,"Gene"] <- Table1$Genes[grep(pattern = paste("\b",Table2$Desc[x],"\b",sep=""),x = Table1$Desc)]
}
Table2
Col1 Desc Gene
1 1 Sweet Candy
2 2 Sugar Candy
3 3 Dentist Toothpaste
4 4 Red Apple
5 5 Fruit Apple
6 6 Fruity Candy
如果我们可以在命名列表或 2 个向量(例如,2 列数据框)中进行键查找,我们可以使用我维护的 *qdapTools** 包中的 %l%
函数。首先,我将使用 strsplit
函数将您的 Table1$desc
拆分为一个命名列表。那是关键。我们可以通过 Table2$Desc
进行查找。这在后端使用 *data.table** 包,所以速度非常快:
library(qdapTools)
key <- setNames(strsplit(as.character(Table1[["Desc"]]), "\s*,\s*"), Table1[["Genes"]])
## $Apple
## [1] "Healthy" "Red" "Fruit"
##
## $Candy
## [1] "Sweet" "Cavities" "Sugar" "Fruity"
##
## $Toothpaste
## [1] "Minty" "Dentist"
Table2[["Gene"]] <- Table2[["Desc"]] %l% key
## Col1 Desc Gene
## 1 1 Sweet Candy
## 2 2 Sugar Candy
## 3 3 Dentist Toothpaste
## 4 4 Red Apple
## 5 5 Fruit Apple
## 6 6 Fruity Candy
这是一个纯基矢量查找,应该也非常快:
x <- strsplit(as.character(Table1[["Desc"]]), "\s*,\s*")
key <- setNames(rep(Table1[["Genes"]], sapply(x, length)), unlist(x))
Table2[["Gene"]] <- key[match(Table2[["Desc"]], names(key))]
按照@TylerRinker 的回答,我首先格式化 Table1$Desc
字符串:
Table1a <- with(Table1,
stack(setNames(sapply(as.character(Desc),strsplit,split=","),Genes)))
names(Table1a) <- c("Desc","Genes")
然后转到data.table
:
require(data.table)
DT1 <- data.table(Table1a,key="Desc")
DT2 <- data.table(Table2,key="Desc")
然后合并-n-定义:
DT2[DT1,Gene:=Genes]
# Col1 Desc Gene
# 1: 3 Dentist Toothpaste
# 2: 5 Fruit Apple
# 3: 6 Fruity Candy
# 4: 4 Red Apple
# 5: 2 Sugar Candy
# 6: 1 Sweet Candy
假设没有太多要匹配的词,这里有一个使用一些 tidyverse
函数的选项:
library(tidyverse)
crossing(Table1, Table2) %>%
mutate_if(is.factor, as.character) %>%
rowwise() %>%
filter(str_detect(Desc, Desc1)) %>%
select(Col1, Desc = Desc1, Genes) %>%
arrange(Col1)
# A tibble: 7 x 3
Col1 Desc Genes
<dbl> <chr> <chr>
1 1 Sweet Candy
2 2 Sugar Candy
3 3 Dentist Toothpaste
4 4 Red Apple
5 5 Fruit Apple
6 5 Fruit Candy
7 6 Fruity Candy