r data.table ( <= 1.9.4) 加入行为

r data.table ( <= 1.9.4) join behaviour

一段时间后我又开始使用 r 和 data.table,但我仍然对连接有疑问。我之前问过 this question 得到了一个令人满意的解释,但我仍然没有真正理解逻辑。 让我们考虑几个例子:

library("data.table")
X <- data.table(chiave=c("a", "a", "a", "b", "b"),valore1=1:5)
Y <- data.table(chiave=c("a", "b", "c", "d"),valore2=1:4)
X
   chiave valore1
1:      a       1
2:      a       2
3:      a       3
4:      b       4
5:      b       5
 Y
   chiave valore2
1:      a       1
2:      b       2
3:      c       3
4:      d       4

加入时出现错误:

 setkey(X,chiave)
 X[Y]
# Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x),  : 
  Join results in 7 rows; more than 5 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

所以:

 X[Y,allow.cartesian=T]
   chiave valore1 valore2
1:      a       1       1
2:      a       2       1
3:      a       3       1
4:      b       4       2
5:      b       5       2
6:      c      NA       3
7:      d      NA       4

请注意 X 有重复键而 i 没有。如果我将 Y 更改为:

 Y <- data.table(chiave=c("b", "c", "d"),valore2=1:3)
 Y
   chiave valore2
1:      b       1
2:      c       2
3:      d       3

连接完成时没有错误消息,也不需要 allow.cartesian,但逻辑上情况是相同的:X 有多个键而 i 没有。

 X[Y]
   chiave valore1 valore2
1:      b       4       1
2:      b       5       1
3:      c      NA       2
4:      d      NA       3

另一方面:

 X <- data.table(chiave=c("a", "a", "a", "a", "a", "a", "b", "b"),valore1=1:8)
 Y <- data.table(chiave=c("b", "b", "d"),valore2=1:3)
 X
   chiave valore1
1:      a       1
2:      a       2
3:      a       3
4:      a       4
5:      a       5
6:      a       6
7:      b       7
8:      b       8
 Y
   chiave valore2
1:      b       1
2:      b       2
3:      d       3

我在 Xi 中都有多个键,但是连接(和笛卡尔积)已经完成,没有错误消息,也不需要 allow.cartesian

 setkey(X,chiave)
 X[Y]
   chiave valore1 valore2
1:      b       7       1
2:      b       8       1
3:      b       7       2
4:      b       8       2
5:      d      NA       3

从我的角度来看,当且仅当我在 X 和 i 中都有多个键时我才需要被警告(不仅仅是结果 table 的行数多于 max(nrow(x),nrow(i)) ) 并且只有在这种情况下我才看到 allow.cartesian 的需要(所以在我的前两个例子中没有)。

为了保持这个答案,allow.cartesian 的这种行为已在当前开发版本 v1.9.5 中得到修复,并将很快在 CRAN 上作为 v1.9.6 提供。奇怪的版本是开发的,甚至是稳定的。来自 NEWS

  1. allow.cartesian is ignored during joins when:

    • i has no duplicates and mult="all". Closes #742. Thanks to @nigmastar for the report.
    • assigning by reference, i.e., j has :=. Closes #800. Thanks to @matthieugomez for the report.

    In both these cases (and during a not-join which was already fixed in 1.9.4), allow.cartesian can be safely ignored.