data.table 对非相等自连接的更新的奇怪行为

Question

在准备 answer to the question dplyr or data.table to calculate time series aggregations in R 时，我注意到我确实得到了不同的结果，具体取决于 table 是就地更新还是 return 作为新对象编辑。此外，当我在 non-equi join 条件中更改列的顺序时，我确实得到了不同的结果。

目前，我对此没有任何解释，可能是由于我方面的重大误解或简单的编码错误。

Please, note that this question is asking particularly for explanations of the observed behaviour of data.table joins. If you have alternative solutions to the underlying problem, please, feel free to post an answer to the original question.

原始问题和工作答案

最初的问题是如何使用这些数据计算每位患者在住院前 365 天内（包括实际住院）的住院次数：

library(data.table)   # version 1.10.4 (CRAN) or 1.10.5 (devel built 2017-08-19)
DT0 <- data.table(
  patient.id = c(1L, 2L, 1L, 1L, 2L, 2L, 2L),
  hospitalization.date = as.Date(c("2013/10/15", "2014/10/15", "2015/7/16", "2016/1/7", 
                                   "2015/12/20", "2015/12/25", "2016/2/10")))
setorder(DT0, patient.id, hospitalization.date)
DT0

   patient.id hospitalization.date
1:          1           2013-10-15
2:          1           2015-07-16
3:          1           2016-01-07
4:          2           2014-10-15
5:          2           2015-12-20
6:          2           2015-12-25
7:          2           2016-02-10

下面的代码给出了预期的答案（为清楚起见，在此处添加了额外的帮助列）

# add helper columns
DT0[, start.date := hospitalization.date - 365][
  , end.date := hospitalization.date][]
DT0

   patient.id hospitalization.date start.date   end.date
1:          1           2013-10-15 2012-10-15 2013-10-15
2:          1           2015-07-16 2014-07-16 2015-07-16
3:          1           2016-01-07 2015-01-07 2016-01-07
4:          2           2014-10-15 2013-10-15 2014-10-15
5:          2           2015-12-20 2014-12-20 2015-12-20
6:          2           2015-12-25 2014-12-25 2015-12-25
7:          2           2016-02-10 2015-02-10 2016-02-10

result <- DT0[DT0, on = c("patient.id", "hospitalization.date>=start.date", 
              "hospitalization.date<=end.date"), 
   .(hospitalizations.last.year = .N), by = .EACHI][]
result

   patient.id hospitalization.date hospitalization.date hospitalizations.last.year
1:          1           2012-10-15           2013-10-15                          1
2:          1           2014-07-16           2015-07-16                          1
3:          1           2015-01-07           2016-01-07                          2
4:          2           2013-10-15           2014-10-15                          1
5:          2           2014-12-20           2015-12-20                          1
6:          2           2014-12-25           2015-12-25                          2
7:          2           2015-02-10           2016-02-10                          3

重命名和重复的列名除外（保留原样以供比较）。

对于patient.id == 2，最后一行的结果是3，因为患者自2015-02-10以来第三次在2016-02-10住院。

就地加入更新

result 是一个新的 data.table 对象，它占用额外的内存。我尝试使用以下方法更新原始 data.table 对象：

# use copy of DT0 which can be safely modified
DT <- copy(DT0)

DT[DT, on = c("patient.id", "hospitalization.date>=start.date", 
            "hospitalization.date<=end.date"), 
   hospitalizations.last.year := .N, by = .EACHI]
DT

   patient.id hospitalization.date start.date   end.date hospitalizations.last.year
1:          1           2013-10-15 2012-10-15 2013-10-15                          1
2:          1           2015-07-16 2014-07-16 2015-07-16                          2
3:          1           2016-01-07 2015-01-07 2016-01-07                          2
4:          2           2014-10-15 2013-10-15 2014-10-15                          1
5:          2           2015-12-20 2014-12-20 2015-12-20                          3
6:          2           2015-12-25 2014-12-25 2015-12-25                          3
7:          2           2016-02-10 2015-02-10 2016-02-10                          3

DT现已更新到位，但第 5 行和第 6 行现在分别显示 3 次住院治疗，而不是 1 次或 2 次。似乎现在每一行的最后一个时期内的住院总人数 returned。

更改条件中列的顺序。

非等值连接条件中的列顺序也很重要，即使在自连接中也是如此：

result <- DT0[DT0, on = c("patient.id", "start.date<=hospitalization.date", 
                          "end.date>=hospitalization.date"), 
              .(hospitalizations.last.year = .N), by = .EACHI][]
result

我的期望是 "start.date<=hospitalization.date" 等同于 "hospitalization.date>=start.date"（请注意 < 和 > 也被切换）但结果

   patient.id start.date   end.date hospitalizations.last.year
1:          1 2013-10-15 2013-10-15                          1
2:          1 2015-07-16 2015-07-16                          2
3:          1 2016-01-07 2016-01-07                          1
4:          2 2014-10-15 2014-10-15                          1
5:          2 2015-12-20 2015-12-20                          3
6:          2 2015-12-25 2015-12-25                          2
7:          2 2016-02-10 2016-02-10                          1

不一样。现在好像在统计即将住院的人数

有趣的是，in place 更新现在 return 相同的结果（除了一些列名称）：

# use copy of DT0 which can be safely modified
DT <- copy(DT0)
DT[DT, on = c("patient.id", "start.date<=hospitalization.date", 
              "end.date>=hospitalization.date"), 
   hospitalizations.last.year := .N, by = .EACHI]
DT

   patient.id hospitalization.date start.date   end.date hospitalizations.last.year
1:          1           2013-10-15 2012-10-15 2013-10-15                          1
2:          1           2015-07-16 2014-07-16 2015-07-16                          2
3:          1           2016-01-07 2015-01-07 2016-01-07                          1
4:          2           2014-10-15 2013-10-15 2014-10-15                          1
5:          2           2015-12-20 2014-12-20 2015-12-20                          3
6:          2           2015-12-25 2014-12-25 2015-12-25                          2
7:          2           2016-02-10 2015-02-10 2016-02-10                          1

相关

有一个潜在的related question which led to an issue reported on github。

有一个关于 x. 前缀与 非相等连接 的用法。

Answer 1

分组 by=.EACHI 表示 "by each i" 而不是 "by each x"。

# for readability / my sanity
DT = copy(DT0)
setnames(DT, "hospitalization.date", "h.date")

z = DT[DT, on = .(patient.id, h.date >= start.date, h.date <= end.date), 
   .(x.h.date, patient.id, i.start.date, i.end.date, g = .GRP, .N)
, by=.EACHI][, utils:::tail.default(.SD, 6)]

      x.h.date patient.id i.start.date i.end.date g N
 1: 2013-10-15          1   2012-10-15 2013-10-15 1 1 * 
 2: 2015-07-16          1   2014-07-16 2015-07-16 2 1 
 3: 2015-07-16          1   2015-01-07 2016-01-07 3 2 *
 4: 2016-01-07          1   2015-01-07 2016-01-07 3 2 *
 5: 2014-10-15          2   2013-10-15 2014-10-15 4 1 *  
 6: 2015-12-20          2   2014-12-20 2015-12-20 5 1
 7: 2015-12-20          2   2014-12-25 2015-12-25 6 2  
 8: 2015-12-25          2   2014-12-25 2015-12-25 6 2 
 9: 2015-12-20          2   2015-02-10 2016-02-10 7 3 *
10: 2015-12-25          2   2015-02-10 2016-02-10 7 3 *
11: 2016-02-10          2   2015-02-10 2016-02-10 7 3 *

对于患者 1，分组是

.(start.date = 2012-10-15, end.date = 2013-10-15)，计数 1
.(start.date = 2014-07-16, end.date = 2015-07-16)，计数 1
.(start.date = 2015-01-07, end.date = 2016-01-07)，计数 2

幸运的是，此连接中有七个组，原始 table 中有七行。

对于更棘手的问题，我会从我的笔记中借用一个例子：

Beware multiple matches in an update join. When there are multiple matches, an update join will apparently only use the last one. Unfortunately, this is done silently. Try:
a = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), 
  t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15)
b = data.table(id = 1:2, y = c(11L, 15L))
b[a, on=.(id), x := i.x, verbose = TRUE ][]

# Calculated ad hoc index in 0 secs
# Starting bmerge ...done in 0.02 secs
# Detected that j uses these columns: x,i.x 
# Assigning to 3 row subset of 2 rows
#    id  y  x
# 1:  1 11 12
# 2:  2 15 13
With verbose on, we see a helpful message about assignment “to 3 row subset of 2 rows.”

-- modified from "Quick R Tutorial", section "Updating in a join"

不幸的是，在 OP 的情况下，verbose=TRUE没有提供这样的消息。

DT[DT, on = .(patient.id, h.date >= start.date, h.date <= end.date), 
   n := .N, by = .EACHI, verbose=TRUE]
# Non-equi join operators detected ... 
#   forder took ... 0.01 secs
#   Generating group lengths ... done in 0 secs
#   Generating non-equi group ids ... done in 0 secs
#   Found 1 non-equi group(s) ...
# Starting bmerge ...done in 0.02 secs
# Detected that j uses these columns: <none> 
# lapply optimization is on, j unchanged as '.N'
# Making each group and running j (GForce FALSE) ... 
#   memcpy contiguous groups took 0.000s for 7 groups
#   eval(j) took 0.000s for 7 calls
# 0.01 secs

但是，我们可以看到每个 x 组的最后一行确实包含 OP 看到的值。我在上面用星号手动标记了这些。或者，您可以用 z[, mrk := replace(rep(0, .N), .N, 1), by=x.h.date].

标记它们

作为参考，这里的更新连接是...

DT[, n := 
  .SD[.SD, on = .(patient.id, h.date >= start.date, h.date <= end.date), .N, by=.EACHI]$N 
]

   patient.id hospitalization.date start.date   end.date     h.date n
1:          1           2013-10-15 2012-10-15 2013-10-15 2013-10-15 1
2:          1           2015-07-16 2014-07-16 2015-07-16 2015-07-16 1
3:          1           2016-01-07 2015-01-07 2016-01-07 2016-01-07 2
4:          2           2014-10-15 2013-10-15 2014-10-15 2014-10-15 1
5:          2           2015-12-20 2014-12-20 2015-12-20 2015-12-20 1
6:          2           2015-12-25 2014-12-25 2015-12-25 2015-12-25 2
7:          2           2016-02-10 2015-02-10 2016-02-10 2016-02-10 3

这是处理这种情况的 correct/idiomatic 方法，根据在另一个 table 中查找 x 的每一行并计算一个列来向 x 添加列结果总结：

x[, v := DT2[.SD, on=, j, by=.EACHI]$V1 ]

data.table 对非相等自连接的更新的奇怪行为

Odd behaviour of data.table's update on non-equi self-join

join

r

self-join

data.table

原始问题和工作答案

就地加入更新

更改条件中列的顺序。

相关