Data.table - 多个表的左外连接

Data.table - left outer join on multiple tables

假设你有这样的数据

fruits <- data.table(FruitID=c(1,2,3), Fruit=c("Apple", "Banana", "Strawberry"))
colors <- data.table(ColorID=c(1,2,3,4,5), FruitID=c(1,1,1,2,3), Color=c("Red","Yellow","Green","Yellow","Red"))
tastes <- data.table(TasteID=c(1,2,3), FruitID=c(1,1,3), Taste=c("Sweeet", "Sour", "Sweet"))

setkey(fruits, "FruitID")
setkey(colors, "ColorID")
setkey(tastes, "TasteID")

fruits
   FruitID      Fruit
1:       1      Apple
2:       2     Banana
3:       3 Strawberry

colors
   ColorID FruitID  Color
1:       1       1    Red
2:       2       1 Yellow
3:       3       1  Green
4:       4       2 Yellow
5:       5       3    Red

tastes
   TasteID FruitID  Taste
1:       1       1 Sweeet
2:       2       1   Sour
3:       3       3  Sweet

我通常需要对这样的数据执行左外连接。例如,"give me all fruits and their colors" 要求我写(也许有更好的方法?)

setkey(colors, "FruitID")
result <- colors[fruits, allow.cartesian=TRUE]
setkey(colors, "ColorID")

这么简单又频繁的任务,三行代码似乎有些过分,所以我写了一个方法myLeftJoin

myLeftJoin <- function(tbl1, tbl2){
  # Performs a left join using the key in tbl1 (i.e. keeps all rows from tbl1 and only matching rows from tbl2)

  oldkey <- key(tbl2)
  setkeyv(tbl2, key(tbl1))
  result <- tbl2[tbl1, allow.cartesian=TRUE]
  setkeyv(tbl2, oldkey)
  return(result)
}

我可以像

myLeftJoin(fruits, colors)
   ColorID FruitID  Color      Fruit
1:       1       1    Red      Apple
2:       2       1 Yellow      Apple
3:       3       1  Green      Apple
4:       4       2 Yellow     Banana
5:       5       3    Red Strawberry

如何扩展此方法,以便我可以将任意数量的表传递给它并获得所有表的链式左外连接?像 myLeftJoin(tbl1, ...)

例如,我希望 myleftJoin(fruits, colors, tastes) 的结果等同于

setkey(colors, "FruitID")
setkey(tastes, "FruitID")
result <- tastes[colors[fruits, allow.cartesian=TRUE], allow.cartesian=TRUE]
setkey(tastes, "TasteID")
setkey(colors, "ColorID")

result
   TasteID FruitID  Taste ColorID  Color      Fruit
1:       1       1 Sweeet       1    Red      Apple
2:       2       1   Sour       1    Red      Apple
3:       1       1 Sweeet       2 Yellow      Apple
4:       2       1   Sour       2 Yellow      Apple
5:       1       1 Sweeet       3  Green      Apple
6:       2       1   Sour       3  Green      Apple
7:      NA       2     NA       4 Yellow     Banana
8:       3       3  Sweet       5    Red Strawberry

也许我错过了使用 data.table 包中的方法的优雅解决方案?谢谢

(编辑:修复了我的数据中的一个错误)

您可以一次使用基数 R 的 Reduceleft_join (来自 dplyr data.table 个对象的列表鉴于此,您正在加入具有公共列名的表,并且 愿意避免为 data.table 个对象[=31= 多次设置 keys ]

library(data.table) # <= v1.9.4
library(dplyr) # left_join

Reduce(function(...) left_join(...), list(fruits,colors,tastes))

# Source: local data table [8 x 6]

#  FruitID      Fruit ColorID  Color TasteID  Taste
#1       1      Apple       1    Red       1 Sweeet
#2       1      Apple       1    Red       2   Sour
#3       1      Apple       2 Yellow       1 Sweeet
#4       1      Apple       2 Yellow       2   Sour
#5       1      Apple       3  Green       1 Sweeet
#6       1      Apple       3  Green       2   Sour
#7       2     Banana       4 Yellow      NA     NA
#8       3 Strawberry       5    Red       3  Sweet

@Frank 提到的纯 data.table 方法的另一种选择 (注意,这需要将所有 data.table 对象的键设置为 fruitID

library(data.table) # <= v1.9.4
Reduce(function(x,y) y[x, allow.cartesian=TRUE], list(fruits,colors,tastes))

我刚刚在data.table, v1.9.5中提交了一个新特性,使用它我们可以在不设置键的情况下进行连接(即直接指定要连接的列,而不必先使用setkey()):

有了这个,这就是:

require(data.table) # v1.9.5+
fruits[tastes, on="FruitID"][colors, on="FruitID"] # no setkey required
#    FruitID      Fruit TasteID  Taste ColorID  Color
# 1:       1      Apple       1 Sweeet       1    Red
# 2:       1      Apple       2   Sour       1    Red
# 3:       1      Apple       1 Sweeet       2 Yellow
# 4:       1      Apple       2   Sour       2 Yellow
# 5:       1      Apple       1 Sweeet       3  Green
# 6:       1      Apple       2   Sour       3  Green
# 7:       2         NA      NA     NA       4 Yellow
# 8:       3 Strawberry       3  Sweet       5    Red