Data.table - 多个表的左外连接
Data.table - left outer join on multiple tables
假设你有这样的数据
fruits <- data.table(FruitID=c(1,2,3), Fruit=c("Apple", "Banana", "Strawberry"))
colors <- data.table(ColorID=c(1,2,3,4,5), FruitID=c(1,1,1,2,3), Color=c("Red","Yellow","Green","Yellow","Red"))
tastes <- data.table(TasteID=c(1,2,3), FruitID=c(1,1,3), Taste=c("Sweeet", "Sour", "Sweet"))
setkey(fruits, "FruitID")
setkey(colors, "ColorID")
setkey(tastes, "TasteID")
fruits
FruitID Fruit
1: 1 Apple
2: 2 Banana
3: 3 Strawberry
colors
ColorID FruitID Color
1: 1 1 Red
2: 2 1 Yellow
3: 3 1 Green
4: 4 2 Yellow
5: 5 3 Red
tastes
TasteID FruitID Taste
1: 1 1 Sweeet
2: 2 1 Sour
3: 3 3 Sweet
我通常需要对这样的数据执行左外连接。例如,"give me all fruits and their colors" 要求我写(也许有更好的方法?)
setkey(colors, "FruitID")
result <- colors[fruits, allow.cartesian=TRUE]
setkey(colors, "ColorID")
这么简单又频繁的任务,三行代码似乎有些过分,所以我写了一个方法myLeftJoin
myLeftJoin <- function(tbl1, tbl2){
# Performs a left join using the key in tbl1 (i.e. keeps all rows from tbl1 and only matching rows from tbl2)
oldkey <- key(tbl2)
setkeyv(tbl2, key(tbl1))
result <- tbl2[tbl1, allow.cartesian=TRUE]
setkeyv(tbl2, oldkey)
return(result)
}
我可以像
myLeftJoin(fruits, colors)
ColorID FruitID Color Fruit
1: 1 1 Red Apple
2: 2 1 Yellow Apple
3: 3 1 Green Apple
4: 4 2 Yellow Banana
5: 5 3 Red Strawberry
如何扩展此方法,以便我可以将任意数量的表传递给它并获得所有表的链式左外连接?像 myLeftJoin(tbl1, ...)
例如,我希望 myleftJoin(fruits, colors, tastes)
的结果等同于
setkey(colors, "FruitID")
setkey(tastes, "FruitID")
result <- tastes[colors[fruits, allow.cartesian=TRUE], allow.cartesian=TRUE]
setkey(tastes, "TasteID")
setkey(colors, "ColorID")
result
TasteID FruitID Taste ColorID Color Fruit
1: 1 1 Sweeet 1 Red Apple
2: 2 1 Sour 1 Red Apple
3: 1 1 Sweeet 2 Yellow Apple
4: 2 1 Sour 2 Yellow Apple
5: 1 1 Sweeet 3 Green Apple
6: 2 1 Sour 3 Green Apple
7: NA 2 NA 4 Yellow Banana
8: 3 3 Sweet 5 Red Strawberry
也许我错过了使用 data.table 包中的方法的优雅解决方案?谢谢
(编辑:修复了我的数据中的一个错误)
您可以一次使用基数 R 的 Reduce
到 left_join
(来自 dplyr
) data.table
个对象的列表鉴于此,您正在加入具有公共列名的表,并且 愿意避免为 data.table
个对象[=31= 多次设置 keys
]
library(data.table) # <= v1.9.4
library(dplyr) # left_join
Reduce(function(...) left_join(...), list(fruits,colors,tastes))
# Source: local data table [8 x 6]
# FruitID Fruit ColorID Color TasteID Taste
#1 1 Apple 1 Red 1 Sweeet
#2 1 Apple 1 Red 2 Sour
#3 1 Apple 2 Yellow 1 Sweeet
#4 1 Apple 2 Yellow 2 Sour
#5 1 Apple 3 Green 1 Sweeet
#6 1 Apple 3 Green 2 Sour
#7 2 Banana 4 Yellow NA NA
#8 3 Strawberry 5 Red 3 Sweet
@Frank 提到的纯 data.table 方法的另一种选择
(注意,这需要将所有 data.table
对象的键设置为 fruitID
)
library(data.table) # <= v1.9.4
Reduce(function(x,y) y[x, allow.cartesian=TRUE], list(fruits,colors,tastes))
我刚刚在data.table, v1.9.5
中提交了一个新特性,使用它我们可以在不设置键的情况下进行连接(即直接指定要连接的列,而不必先使用setkey()
):
有了这个,这就是:
require(data.table) # v1.9.5+
fruits[tastes, on="FruitID"][colors, on="FruitID"] # no setkey required
# FruitID Fruit TasteID Taste ColorID Color
# 1: 1 Apple 1 Sweeet 1 Red
# 2: 1 Apple 2 Sour 1 Red
# 3: 1 Apple 1 Sweeet 2 Yellow
# 4: 1 Apple 2 Sour 2 Yellow
# 5: 1 Apple 1 Sweeet 3 Green
# 6: 1 Apple 2 Sour 3 Green
# 7: 2 NA NA NA 4 Yellow
# 8: 3 Strawberry 3 Sweet 5 Red
假设你有这样的数据
fruits <- data.table(FruitID=c(1,2,3), Fruit=c("Apple", "Banana", "Strawberry"))
colors <- data.table(ColorID=c(1,2,3,4,5), FruitID=c(1,1,1,2,3), Color=c("Red","Yellow","Green","Yellow","Red"))
tastes <- data.table(TasteID=c(1,2,3), FruitID=c(1,1,3), Taste=c("Sweeet", "Sour", "Sweet"))
setkey(fruits, "FruitID")
setkey(colors, "ColorID")
setkey(tastes, "TasteID")
fruits
FruitID Fruit
1: 1 Apple
2: 2 Banana
3: 3 Strawberry
colors
ColorID FruitID Color
1: 1 1 Red
2: 2 1 Yellow
3: 3 1 Green
4: 4 2 Yellow
5: 5 3 Red
tastes
TasteID FruitID Taste
1: 1 1 Sweeet
2: 2 1 Sour
3: 3 3 Sweet
我通常需要对这样的数据执行左外连接。例如,"give me all fruits and their colors" 要求我写(也许有更好的方法?)
setkey(colors, "FruitID")
result <- colors[fruits, allow.cartesian=TRUE]
setkey(colors, "ColorID")
这么简单又频繁的任务,三行代码似乎有些过分,所以我写了一个方法myLeftJoin
myLeftJoin <- function(tbl1, tbl2){
# Performs a left join using the key in tbl1 (i.e. keeps all rows from tbl1 and only matching rows from tbl2)
oldkey <- key(tbl2)
setkeyv(tbl2, key(tbl1))
result <- tbl2[tbl1, allow.cartesian=TRUE]
setkeyv(tbl2, oldkey)
return(result)
}
我可以像
myLeftJoin(fruits, colors)
ColorID FruitID Color Fruit
1: 1 1 Red Apple
2: 2 1 Yellow Apple
3: 3 1 Green Apple
4: 4 2 Yellow Banana
5: 5 3 Red Strawberry
如何扩展此方法,以便我可以将任意数量的表传递给它并获得所有表的链式左外连接?像 myLeftJoin(tbl1, ...)
例如,我希望 myleftJoin(fruits, colors, tastes)
的结果等同于
setkey(colors, "FruitID")
setkey(tastes, "FruitID")
result <- tastes[colors[fruits, allow.cartesian=TRUE], allow.cartesian=TRUE]
setkey(tastes, "TasteID")
setkey(colors, "ColorID")
result
TasteID FruitID Taste ColorID Color Fruit
1: 1 1 Sweeet 1 Red Apple
2: 2 1 Sour 1 Red Apple
3: 1 1 Sweeet 2 Yellow Apple
4: 2 1 Sour 2 Yellow Apple
5: 1 1 Sweeet 3 Green Apple
6: 2 1 Sour 3 Green Apple
7: NA 2 NA 4 Yellow Banana
8: 3 3 Sweet 5 Red Strawberry
也许我错过了使用 data.table 包中的方法的优雅解决方案?谢谢
(编辑:修复了我的数据中的一个错误)
您可以一次使用基数 R 的 Reduce
到 left_join
(来自 dplyr
) data.table
个对象的列表鉴于此,您正在加入具有公共列名的表,并且 愿意避免为 data.table
个对象[=31= 多次设置 keys
]
library(data.table) # <= v1.9.4
library(dplyr) # left_join
Reduce(function(...) left_join(...), list(fruits,colors,tastes))
# Source: local data table [8 x 6]
# FruitID Fruit ColorID Color TasteID Taste
#1 1 Apple 1 Red 1 Sweeet
#2 1 Apple 1 Red 2 Sour
#3 1 Apple 2 Yellow 1 Sweeet
#4 1 Apple 2 Yellow 2 Sour
#5 1 Apple 3 Green 1 Sweeet
#6 1 Apple 3 Green 2 Sour
#7 2 Banana 4 Yellow NA NA
#8 3 Strawberry 5 Red 3 Sweet
@Frank 提到的纯 data.table 方法的另一种选择
(注意,这需要将所有 data.table
对象的键设置为 fruitID
)
library(data.table) # <= v1.9.4
Reduce(function(x,y) y[x, allow.cartesian=TRUE], list(fruits,colors,tastes))
我刚刚在data.table, v1.9.5
中提交了一个新特性,使用它我们可以在不设置键的情况下进行连接(即直接指定要连接的列,而不必先使用setkey()
):
有了这个,这就是:
require(data.table) # v1.9.5+
fruits[tastes, on="FruitID"][colors, on="FruitID"] # no setkey required
# FruitID Fruit TasteID Taste ColorID Color
# 1: 1 Apple 1 Sweeet 1 Red
# 2: 1 Apple 2 Sour 1 Red
# 3: 1 Apple 1 Sweeet 2 Yellow
# 4: 1 Apple 2 Sour 2 Yellow
# 5: 1 Apple 1 Sweeet 3 Green
# 6: 1 Apple 2 Sour 3 Green
# 7: 2 NA NA NA 4 Yellow
# 8: 3 Strawberry 3 Sweet 5 Red