使用来自另一个 table 的数据将列添加到 table

Question

我有一个 table，如下所示：

Table1 <- data.frame(
    "Random" = c("A", "B", "C"), 
    "Genes" = c("Apple", "Candy", "Toothpaste"), 
    "Extra" = c("Up", "", "Down"), 
    "Desc" = c("Healthy,Red,Fruit", "Sweet,Cavities,Sugar,Fruity", "Minty,Dentist")
)

给予：

  Random      Genes Extra                       Desc
1      A      Apple    Up          Healthy,Red,Fruit
2      B      Candy       Sweet,Cavities,Sugar,Fruity
3      C Toothpaste  Down              Minty,Dentist

我有另一个 table 的描述，想添加一个包含基因的列。例如 Table2 将是：

Table2 <- data.frame(
    "Col1" = c(1, 2, 3, 4, 5, 6), 
    "Desc" = c("Sweet", "Sugar", "Dentist", "Red", "Fruit", "Fruity")
)

给予：

  Col1    Desc
1    1   Sweet
2    2   Sugar
3    3 Dentist
4    4     Red
5    5   Fruit
6    6  Fruity

我想在 Table2 中添加另一列名为 "Genes" 的列，该列与 table 中的 "Desc" 相匹配，并添加 Table1 中的基因以获取：

  Col1    Desc    Gene
1    1   Sweet    Candy
2    2   Sugar    Candy
3    3 Dentist    Toothpaste
4    4     Red    Apple
5    5   Fruit    Apple
6    6  Fruity    Candy

Answer 1

您可以尝试 cSplit 从 splitstackshape 拆分 "Table1" 中的 'Desc' 列并将数据集从 'wide' 转换为 'long'格式。输出将是 data.table。我们可以使用 data.table 方法将键列设置为 'Desc' (setkey)，与 "Table2" 连接，最后删除输出中不需要的列通过选择列或将不需要的列分配 (:=) 为 NULL

library(splitstackshape)
setkey(cSplit(Table1, 'Desc', ',', 'long'),Desc)[Table2[2:1]][
                   ,c(5,4,2), with=FALSE]
#  Col1    Desc      Genes
#1:    1   Sweet      Candy
#2:    2   Sugar      Candy
#3:    3 Dentist Toothpaste
#4:    4     Red      Apple
#5:    5   Fruit      Apple
#6:    6  Fruity      Candy

Answer 2

这是使用中间 linking table 的基础 R 中的方法：

# create an intermediate data.frame with all the key (Desc) / value (Gene) pairs
df  <-  NULL
for(i in seq(nrow(Table1)))
    df  <-  rbind(df,
                  data.frame(Gene =Table1$Genes[i],
                            Desc =strsplit(as.character(Table1$Desc)[i],',')[[1]]))
df 
#>         Gene     Desc
#> 1      Apple  Healthy
#> 2      Apple      Red
#> 3      Apple    Fruit
#> 4      Candy    Sweet
#> 5      Candy Cavities
#> 6      Candy    Sugar
#> 7      Candy   Fruity
#> 8 Toothpaste    Minty
#> 9 Toothpaste  Dentist

现在 link 以通常的方式进行：

Table2$Gene  <-  df$Gene[match(Table2$Desc,df$Desc)]

Answer 3

假设每个字符串都是唯一的（即 Fruit 不能出现超过一个 Gene），您可以使用 for 循环和 grep 相当容易地做到这一点。但是，它在庞大的数据集上可能会很慢。

options(stringsAsFactors = FALSE)
Table1 <- data.frame("Random" = c("A", "B", "C"), "Genes" = c("Apple", "Candy", "Toothpaste"), "Extra" = c("Up", "", "Down"), "Desc" = c("Healthy,Red,Fruit", "Sweet,Cavities,Sugar,Fruity", "Minty,Dentist"))
Table2 <- data.frame("Col1" = c(1, 2, 3, 4, 5, 6), "Desc" = c("Sweet", "Sugar", "Dentist", "Red", "Fruit", "Fruity"))

Table2$Gene <- NA
for(x in 1:nrow(Table2)) {

    Table2[x,"Gene"] <- Table1$Genes[grep(pattern = paste("\b",Table2$Desc[x],"\b",sep=""),x = Table1$Desc)]
}
Table2

  Col1    Desc       Gene
1    1   Sweet      Candy
2    2   Sugar      Candy
3    3 Dentist Toothpaste
4    4     Red      Apple
5    5   Fruit      Apple
6    6  Fruity      Candy

Answer 4

如果我们可以在命名列表或 2 个向量（例如，2 列数据框）中进行键查找，我们可以使用我维护的 *qdapTools** 包中的 %l% 函数。首先，我将使用 strsplit 函数将您的 Table1$desc 拆分为一个命名列表。那是关键。我们可以通过 Table2$Desc 进行查找。这在后端使用 *data.table** 包，所以速度非常快：

library(qdapTools)

key <- setNames(strsplit(as.character(Table1[["Desc"]]), "\s*,\s*"), Table1[["Genes"]])

## $Apple
## [1] "Healthy" "Red"     "Fruit"  
## 
## $Candy
## [1] "Sweet"    "Cavities" "Sugar"    "Fruity"  
## 
## $Toothpaste
## [1] "Minty"   "Dentist"

Table2[["Gene"]] <- Table2[["Desc"]] %l% key

##   Col1    Desc       Gene
## 1    1   Sweet      Candy
## 2    2   Sugar      Candy
## 3    3 Dentist Toothpaste
## 4    4     Red      Apple
## 5    5   Fruit      Apple
## 6    6  Fruity      Candy

这是一个纯基矢量查找，应该也非常快：

x <- strsplit(as.character(Table1[["Desc"]]), "\s*,\s*")
key <- setNames(rep(Table1[["Genes"]], sapply(x, length)), unlist(x))
Table2[["Gene"]] <- key[match(Table2[["Desc"]], names(key))]

Answer 5

按照@TylerRinker 的回答，我首先格式化 Table1$Desc 字符串：

Table1a        <- with(Table1,
                    stack(setNames(sapply(as.character(Desc),strsplit,split=","),Genes)))
names(Table1a) <- c("Desc","Genes")

然后转到data.table:

require(data.table)
DT1 <- data.table(Table1a,key="Desc")
DT2 <- data.table(Table2,key="Desc")

然后合并-n-定义：

DT2[DT1,Gene:=Genes]
#    Col1    Desc       Gene
# 1:    3 Dentist Toothpaste
# 2:    5   Fruit      Apple
# 3:    6  Fruity      Candy
# 4:    4     Red      Apple
# 5:    2   Sugar      Candy
# 6:    1   Sweet      Candy

Answer 6

假设没有太多要匹配的词，这里有一个使用一些 tidyverse 函数的选项：

library(tidyverse)
crossing(Table1, Table2) %>% 
  mutate_if(is.factor, as.character) %>% 
  rowwise() %>% 
  filter(str_detect(Desc, Desc1)) %>% 
  select(Col1, Desc = Desc1, Genes) %>% 
  arrange(Col1)

# A tibble: 7 x 3
   Col1 Desc    Genes     
  <dbl> <chr>   <chr>     
1     1 Sweet   Candy     
2     2 Sugar   Candy     
3     3 Dentist Toothpaste
4     4 Red     Apple     
5     5 Fruit   Apple     
6     5 Fruit   Candy     
7     6 Fruity  Candy

使用来自另一个 table 的数据将列添加到 table

Add column to table with data from another table

r

dataframe