dplyr full_join 没有按预期工作
dplyr full_join does not work as expected
这是一个玩具示例(其中合并来自基础包,完整来自 dplyr):
require(dplyr)
a = data.frame(Day=Sys.Date()+1:5,x=1:5)
b = data.frame(Day=Sys.Date()-1:5,x=3*(1:5))
x1 = b
x2 = b
for(i in 1:10){
x1=full_join(x1,a,by="Day")
x2 = merge(x2,a,by="Day",all=T)
}
x1 和x2 不同。我希望 x2 因为 "a" 被附加到最后。
这是 x2(前 5 行):
2015-05-14 15 NA NA NA NA NA NA NA NA NA NA
2015-05-15 12 NA NA NA NA NA NA NA NA NA NA
2015-05-16 9 NA NA NA NA NA NA NA NA NA NA
2015-05-17 6 NA NA NA NA NA NA NA NA NA NA
但是来自 full_join 的 x1 是:
Day x.x x.y x.x x.y x.x x.y x.x x.y x.x x.y x
1 2015-05-18 3 NA 3 NA 3 NA 3 NA 3 NA NA
2 2015-05-17 6 NA 6 NA 6 NA 6 NA 6 NA NA
3 2015-05-16 9 NA 9 NA 9 NA 9 NA 9 NA NA
这是一个错误吗?或者这是预期的?我希望 merge (x2) 的输出在逻辑上是正确的....我想要 x2 使用 dplyr full_join。有办法吗?
如果重命名数据框中的列,这两种方法的行为是相同的 a
:
require(dplyr)
a = data.frame(Day=Sys.Date()+1:5,y=1:5)
b = data.frame(Day=Sys.Date()-1:5,x=3*(1:5))
x1 = b
x2 = b
for(i in 1:10){
x1=full_join(x1,a,by="Day")
x2=merge(x2,a,by="Day",all=T)
}
# fix up the column names...
names(x1) <- sapply(1:ncol(x1), function(x) {paste0("V", x)})
names(x2) <- sapply(1:ncol(x2), function(x) {paste0("V", x)})
x1 %>% arrange(desc(V1))
x2 %>% arrange(desc(V1))
所以我在这里更改了这一行:
a = data.frame(Day=Sys.Date()+1:5,x=1:5)
到
a = data.frame(Day=Sys.Date()+1:5,y=1:5)
为什么会这样?当您 运行 您上面提供的代码时,您实际上应该会收到一条警告消息。在我的 RI 版本中,我得到以下信息:
Warning messages:
1: In merge.data.frame(x2, a, by = "Day", all = T) :
column names ‘x.x’, ‘x.y’ are duplicated in the result
2: In merge.data.frame(x2, a, by = "Day", all = T) :
column names ‘x.x’, ‘x.y’ are duplicated in the result
3: In merge.data.frame(x2, a, by = "Day", all = T) :
column names ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’ are duplicated in the result
4: In merge.data.frame(x2, a, by = "Day", all = T) :
column names ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’ are duplicated in the result
5: In merge.data.frame(x2, a, by = "Day", all = T) :
column names ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’ are duplicated in the result
6: In merge.data.frame(x2, a, by = "Day", all = T) :
column names ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’ are duplicated in the result
7: In merge.data.frame(x2, a, by = "Day", all = T) :
column names ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’ are duplicated in the result
8: In merge.data.frame(x2, a, by = "Day", all = T) :
column names ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’ are duplicated in the result
所以我认为,full_join
和 merge
的结果在这种情况下不匹配的原因是因为您提供的两个数据框中的列不明确。当您消除这种歧义时,结果会按预期匹配,因此我认为这不是错误。
这是一个玩具示例(其中合并来自基础包,完整来自 dplyr):
require(dplyr)
a = data.frame(Day=Sys.Date()+1:5,x=1:5)
b = data.frame(Day=Sys.Date()-1:5,x=3*(1:5))
x1 = b
x2 = b
for(i in 1:10){
x1=full_join(x1,a,by="Day")
x2 = merge(x2,a,by="Day",all=T)
}
x1 和x2 不同。我希望 x2 因为 "a" 被附加到最后。 这是 x2(前 5 行):
2015-05-14 15 NA NA NA NA NA NA NA NA NA NA
2015-05-15 12 NA NA NA NA NA NA NA NA NA NA
2015-05-16 9 NA NA NA NA NA NA NA NA NA NA
2015-05-17 6 NA NA NA NA NA NA NA NA NA NA
但是来自 full_join 的 x1 是:
Day x.x x.y x.x x.y x.x x.y x.x x.y x.x x.y x
1 2015-05-18 3 NA 3 NA 3 NA 3 NA 3 NA NA
2 2015-05-17 6 NA 6 NA 6 NA 6 NA 6 NA NA
3 2015-05-16 9 NA 9 NA 9 NA 9 NA 9 NA NA
这是一个错误吗?或者这是预期的?我希望 merge (x2) 的输出在逻辑上是正确的....我想要 x2 使用 dplyr full_join。有办法吗?
如果重命名数据框中的列,这两种方法的行为是相同的 a
:
require(dplyr)
a = data.frame(Day=Sys.Date()+1:5,y=1:5)
b = data.frame(Day=Sys.Date()-1:5,x=3*(1:5))
x1 = b
x2 = b
for(i in 1:10){
x1=full_join(x1,a,by="Day")
x2=merge(x2,a,by="Day",all=T)
}
# fix up the column names...
names(x1) <- sapply(1:ncol(x1), function(x) {paste0("V", x)})
names(x2) <- sapply(1:ncol(x2), function(x) {paste0("V", x)})
x1 %>% arrange(desc(V1))
x2 %>% arrange(desc(V1))
所以我在这里更改了这一行:
a = data.frame(Day=Sys.Date()+1:5,x=1:5)
到
a = data.frame(Day=Sys.Date()+1:5,y=1:5)
为什么会这样?当您 运行 您上面提供的代码时,您实际上应该会收到一条警告消息。在我的 RI 版本中,我得到以下信息:
Warning messages:
1: In merge.data.frame(x2, a, by = "Day", all = T) :
column names ‘x.x’, ‘x.y’ are duplicated in the result
2: In merge.data.frame(x2, a, by = "Day", all = T) :
column names ‘x.x’, ‘x.y’ are duplicated in the result
3: In merge.data.frame(x2, a, by = "Day", all = T) :
column names ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’ are duplicated in the result
4: In merge.data.frame(x2, a, by = "Day", all = T) :
column names ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’ are duplicated in the result
5: In merge.data.frame(x2, a, by = "Day", all = T) :
column names ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’ are duplicated in the result
6: In merge.data.frame(x2, a, by = "Day", all = T) :
column names ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’ are duplicated in the result
7: In merge.data.frame(x2, a, by = "Day", all = T) :
column names ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’ are duplicated in the result
8: In merge.data.frame(x2, a, by = "Day", all = T) :
column names ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’ are duplicated in the result
所以我认为,full_join
和 merge
的结果在这种情况下不匹配的原因是因为您提供的两个数据框中的列不明确。当您消除这种歧义时,结果会按预期匹配,因此我认为这不是错误。