r data.table ( <= 1.9.4) 加入行为
r data.table ( <= 1.9.4) join behaviour
一段时间后我又开始使用 r 和 data.table,但我仍然对连接有疑问。我之前问过 this question 得到了一个令人满意的解释,但我仍然没有真正理解逻辑。
让我们考虑几个例子:
library("data.table")
X <- data.table(chiave=c("a", "a", "a", "b", "b"),valore1=1:5)
Y <- data.table(chiave=c("a", "b", "c", "d"),valore2=1:4)
X
chiave valore1
1: a 1
2: a 2
3: a 3
4: b 4
5: b 5
Y
chiave valore2
1: a 1
2: b 2
3: c 3
4: d 4
加入时出现错误:
setkey(X,chiave)
X[Y]
# Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x), :
Join results in 7 rows; more than 5 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
所以:
X[Y,allow.cartesian=T]
chiave valore1 valore2
1: a 1 1
2: a 2 1
3: a 3 1
4: b 4 2
5: b 5 2
6: c NA 3
7: d NA 4
请注意 X
有重复键而 i
没有。如果我将 Y
更改为:
Y <- data.table(chiave=c("b", "c", "d"),valore2=1:3)
Y
chiave valore2
1: b 1
2: c 2
3: d 3
连接完成时没有错误消息,也不需要 allow.cartesian,但逻辑上情况是相同的:X
有多个键而 i
没有。
X[Y]
chiave valore1 valore2
1: b 4 1
2: b 5 1
3: c NA 2
4: d NA 3
另一方面:
X <- data.table(chiave=c("a", "a", "a", "a", "a", "a", "b", "b"),valore1=1:8)
Y <- data.table(chiave=c("b", "b", "d"),valore2=1:3)
X
chiave valore1
1: a 1
2: a 2
3: a 3
4: a 4
5: a 5
6: a 6
7: b 7
8: b 8
Y
chiave valore2
1: b 1
2: b 2
3: d 3
我在 X
和 i
中都有多个键,但是连接(和笛卡尔积)已经完成,没有错误消息,也不需要 allow.cartesian
setkey(X,chiave)
X[Y]
chiave valore1 valore2
1: b 7 1
2: b 8 1
3: b 7 2
4: b 8 2
5: d NA 3
从我的角度来看,当且仅当我在 X 和 i 中都有多个键时我才需要被警告(不仅仅是结果 table 的行数多于 max(nrow(x),nrow(i)
) ) 并且只有在这种情况下我才看到 allow.cartesian
的需要(所以在我的前两个例子中没有)。
为了保持这个答案,allow.cartesian
的这种行为已在当前开发版本 v1.9.5
中得到修复,并将很快在 CRAN 上作为 v1.9.6
提供。奇怪的版本是开发的,甚至是稳定的。来自 NEWS:
allow.cartesian
is ignored during joins when:
i
has no duplicates and mult="all"
. Closes #742. Thanks to @nigmastar for the report.
- assigning by reference, i.e.,
j
has :=
. Closes #800. Thanks to @matthieugomez for the report.
In both these cases (and during a not-join
which was already fixed in 1.9.4), allow.cartesian
can be safely ignored.
一段时间后我又开始使用 r 和 data.table,但我仍然对连接有疑问。我之前问过 this question 得到了一个令人满意的解释,但我仍然没有真正理解逻辑。 让我们考虑几个例子:
library("data.table")
X <- data.table(chiave=c("a", "a", "a", "b", "b"),valore1=1:5)
Y <- data.table(chiave=c("a", "b", "c", "d"),valore2=1:4)
X
chiave valore1
1: a 1
2: a 2
3: a 3
4: b 4
5: b 5
Y
chiave valore2
1: a 1
2: b 2
3: c 3
4: d 4
加入时出现错误:
setkey(X,chiave)
X[Y]
# Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x), :
Join results in 7 rows; more than 5 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
所以:
X[Y,allow.cartesian=T]
chiave valore1 valore2
1: a 1 1
2: a 2 1
3: a 3 1
4: b 4 2
5: b 5 2
6: c NA 3
7: d NA 4
请注意 X
有重复键而 i
没有。如果我将 Y
更改为:
Y <- data.table(chiave=c("b", "c", "d"),valore2=1:3)
Y
chiave valore2
1: b 1
2: c 2
3: d 3
连接完成时没有错误消息,也不需要 allow.cartesian,但逻辑上情况是相同的:X
有多个键而 i
没有。
X[Y]
chiave valore1 valore2
1: b 4 1
2: b 5 1
3: c NA 2
4: d NA 3
另一方面:
X <- data.table(chiave=c("a", "a", "a", "a", "a", "a", "b", "b"),valore1=1:8)
Y <- data.table(chiave=c("b", "b", "d"),valore2=1:3)
X
chiave valore1
1: a 1
2: a 2
3: a 3
4: a 4
5: a 5
6: a 6
7: b 7
8: b 8
Y
chiave valore2
1: b 1
2: b 2
3: d 3
我在 X
和 i
中都有多个键,但是连接(和笛卡尔积)已经完成,没有错误消息,也不需要 allow.cartesian
setkey(X,chiave)
X[Y]
chiave valore1 valore2
1: b 7 1
2: b 8 1
3: b 7 2
4: b 8 2
5: d NA 3
从我的角度来看,当且仅当我在 X 和 i 中都有多个键时我才需要被警告(不仅仅是结果 table 的行数多于 max(nrow(x),nrow(i)
) ) 并且只有在这种情况下我才看到 allow.cartesian
的需要(所以在我的前两个例子中没有)。
为了保持这个答案,allow.cartesian
的这种行为已在当前开发版本 v1.9.5
中得到修复,并将很快在 CRAN 上作为 v1.9.6
提供。奇怪的版本是开发的,甚至是稳定的。来自 NEWS:
allow.cartesian
is ignored during joins when:
i
has no duplicates andmult="all"
. Closes #742. Thanks to @nigmastar for the report.- assigning by reference, i.e.,
j
has:=
. Closes #800. Thanks to @matthieugomez for the report.In both these cases (and during a
not-join
which was already fixed in 1.9.4),allow.cartesian
can be safely ignored.