计算 data.table 中两行之间的距离

Question

问题总结： 我正在使用 data.table 包（版本 1.9.5）清理鱼类遥测数据集（即，随时间变化的空间坐标） R（版本）在 Windows 7 PC 上。一些数据点是错误的（例如，遥测设备接收到回声）。我们可以说这些点是错误的，因为鱼移动的距离比生物学上可能的距离更远，并且作为异常值脱颖而出。实际数据集包含来自 30 条鱼的超过 2,000,000 行数据，因此使用了 data.table 包。

我正在删除相距太远的点（即行进距离大于最大距离）。但是，我需要在删除一个点后重新计算点之间的行进距离，因为有时会在簇中错误记录 2-3 个数据点。目前，我有一个 for 循环来完成工作，但可能远未达到最佳状态，而且我知道我可能缺少 data.table 包中的一些强大工具。

作为技术说明，我的空间尺度足够小，欧几里德距离有效，我的最大距离标准符合生物学合理性。

我在哪里寻求帮助： 我已经浏览了 SO 并找到了几个有用的答案，但 none 完全符合我的问题。具体来说，所有其他答案仅将一列数据与多行数据进行比较。

这个 answer 使用 data.table 比较两行，但只查看一个变量。
这个看起来很有前途并使用 Reduce，但我无法弄清楚如何将 Reduce 用于两列。
这个 answer 使用了 data.table 的索引功能，但我不知道如何将它与距离函数一起使用。
最后，这个answer演示了data.table的roll功能。但是，我也不知道如何在这个函数中使用两个变量。

这是我的 MVCE：

library(data.table)
## Create dummy data.table
dt <- data.table(fish = 1,
                 time = 1:6,
                 easting = c(1, 2, 10, 11, 3, 4),
                 northing = c(1, 2, 10, 11, 3, 4))
dt[ , dist := 0]

maxDist = 5

## First pass of calculating distances 
for(index in 2:dim(dt)[1]){
    dt[ index,
       dist := as.numeric(dist(dt[c(index -1, index),
                list(easting, northing)]))]
}

## Loop through and remove points until all of the outliers have been
## removed for the data.table. 
while(all(dt[ , dist < maxDist]) == FALSE){
    dt <- copy(dt[ - dt[ , min(which(dist > maxDist))], ])
    ## Loops through and recalculates distance after removing outlier  
    for(index in 2:dim(dt)[1]){
        dt[ index,
           dist := as.numeric(dist(dt[c(index -1, index),
                    list(easting, northing)]))]
    }
}

Answer 1

我有点不明白你为什么要重新计算距离（并不必要地复制数据）而不是只做一次：

last = 1
idx = rep(0, nrow(dt))
for (curr in 1:nrow(dt)) {
  if (dist(dt[c(curr, last), .(easting, northing)]) <= maxDist) {
    idx[curr] = curr
    last = curr
  }
}

dt[idx]
#   fish time easting northing
#1:    1    1       1        1
#2:    1    2       2        2
#3:    1    5       3        3
#4:    1    6       4        4

计算 data.table 中两行之间的距离

calculating distance between two row in a data.table

r

data.table