Return 与两列中定义的数值范围关联的因子

Question

使用由两列 start 和 end 定义的数值范围的数据库，我试图查找与一个数值相关联的因子 code分离向量 identityCodes.

database <- data.frame(start = seq(1, 150000000, 1000), 
                       end = seq(1000, 150000000, 1000),
                       code = paste0(sample(LETTERS, 15000, replace = TRUE), 
                                      sample(LETTERS, 15000, replace = TRUE)))

identityCodes <- sample(1:15000000, 1000)

我想出了一种使用 for 循环和子集查找相应代码的方法：

fun <- function (x, y) {
  z <- rep(NA, length(x))
  for (i in 1:length(x)){
    z[i] <- as.character(y[y["start"] <= x[i] & y["end"] >= x[i], "code"])
  }
  return(z)
}

a <- fun(identityCodes, database)

但是这个方法很慢，尤其是当我要缩放它的时候:

system.time(fun(identityCodes, database))
user   system elapsed 
15.36    0.00   15.50

如何更快地识别与每个 identityCodes 相关的因素？有没有比使用 for 循环和子集化更好的方法来解决这个问题？

Answer 1

这在我的机器上快了大约 45%：

result = lapply(identityCodes, function(x) {
  data.frame(identityCode=x, 
             code=database[database$start <= x & database$end >= x, "code"])
})

result = do.call(rbind, result)

这是输出示例：

  identityCode code
1      6836845   OK
2     14100352   RB
3      2313115   NK
4      8440671   XN
5     11349271   TI
6     14467193   VL

Answer 2

这是我使用 data.table 的尝试。非常快 - 尽管我确信我没有有效地利用它。

给定函数：

# method 1
system.time(result1 <- fun(identityCodes, database))
 user  system elapsed 
 8.99    0.00    8.98

使用data.table

# method 2
require(data.table)

# x: a data.frame with columns start, end, code
# y: a vector with lookup codes
dt_comb <- function(x, y) {
  # convert x to a data.table and set 'start' and 'end' as keys
  DT <- setDT(x)
  setkey(DT, start, end)

  # create a lookup data.table where start and end are the identityCodes
  DT2 <- data.table(start=y, end=y)

  # overlap join where DT2 start & end are within DT start and end
  res <- foverlaps(DT2, DT[, .(start, end)], type="within")

  # store i as row number and key (for sorting later)
  res[, i:=seq_len(nrow(res))]
  setkey(res, i)

  # merge the joined table to the original to get codes
  final <- merge(res, DT, by=c("start", "end"))[order(i), .(code)]

  # export as character the codes
  as.character(final[[1]])
}

system.time(result2 <- dt_comb(x=database, y=identityCodes))
 user  system elapsed 
 0.08    0.00    0.08

identical(result1, result2)
[1] TRUE

编辑：从函数中删减了几行

Return 与两列中定义的数值范围关联的因子

Return factor associated with a numeric range defined in two columns

for-loop

r

subset

data.table