Return 与两列中定义的数值范围关联的因子
Return factor associated with a numeric range defined in two columns
使用由两列 start
和 end
定义的数值范围的数据库,我试图查找与一个数值相关联的因子 code
分离向量 identityCodes
.
database <- data.frame(start = seq(1, 150000000, 1000),
end = seq(1000, 150000000, 1000),
code = paste0(sample(LETTERS, 15000, replace = TRUE),
sample(LETTERS, 15000, replace = TRUE)))
identityCodes <- sample(1:15000000, 1000)
我想出了一种使用 for
循环和子集查找相应代码的方法:
fun <- function (x, y) {
z <- rep(NA, length(x))
for (i in 1:length(x)){
z[i] <- as.character(y[y["start"] <= x[i] & y["end"] >= x[i], "code"])
}
return(z)
}
a <- fun(identityCodes, database)
但是这个方法很慢,尤其是当我要缩放它的时候:
system.time(fun(identityCodes, database))
user system elapsed
15.36 0.00 15.50
如何更快地识别与每个 identityCodes
相关的因素?有没有比使用 for 循环和子集化更好的方法来解决这个问题?
这在我的机器上快了大约 45%:
result = lapply(identityCodes, function(x) {
data.frame(identityCode=x,
code=database[database$start <= x & database$end >= x, "code"])
})
result = do.call(rbind, result)
这是输出示例:
identityCode code
1 6836845 OK
2 14100352 RB
3 2313115 NK
4 8440671 XN
5 11349271 TI
6 14467193 VL
这是我使用 data.table
的尝试。非常快 - 尽管我确信我没有有效地利用它。
给定函数:
# method 1
system.time(result1 <- fun(identityCodes, database))
user system elapsed
8.99 0.00 8.98
使用data.table
# method 2
require(data.table)
# x: a data.frame with columns start, end, code
# y: a vector with lookup codes
dt_comb <- function(x, y) {
# convert x to a data.table and set 'start' and 'end' as keys
DT <- setDT(x)
setkey(DT, start, end)
# create a lookup data.table where start and end are the identityCodes
DT2 <- data.table(start=y, end=y)
# overlap join where DT2 start & end are within DT start and end
res <- foverlaps(DT2, DT[, .(start, end)], type="within")
# store i as row number and key (for sorting later)
res[, i:=seq_len(nrow(res))]
setkey(res, i)
# merge the joined table to the original to get codes
final <- merge(res, DT, by=c("start", "end"))[order(i), .(code)]
# export as character the codes
as.character(final[[1]])
}
system.time(result2 <- dt_comb(x=database, y=identityCodes))
user system elapsed
0.08 0.00 0.08
identical(result1, result2)
[1] TRUE
编辑:从函数中删减了几行
使用由两列 start
和 end
定义的数值范围的数据库,我试图查找与一个数值相关联的因子 code
分离向量 identityCodes
.
database <- data.frame(start = seq(1, 150000000, 1000),
end = seq(1000, 150000000, 1000),
code = paste0(sample(LETTERS, 15000, replace = TRUE),
sample(LETTERS, 15000, replace = TRUE)))
identityCodes <- sample(1:15000000, 1000)
我想出了一种使用 for
循环和子集查找相应代码的方法:
fun <- function (x, y) {
z <- rep(NA, length(x))
for (i in 1:length(x)){
z[i] <- as.character(y[y["start"] <= x[i] & y["end"] >= x[i], "code"])
}
return(z)
}
a <- fun(identityCodes, database)
但是这个方法很慢,尤其是当我要缩放它的时候:
system.time(fun(identityCodes, database))
user system elapsed
15.36 0.00 15.50
如何更快地识别与每个 identityCodes
相关的因素?有没有比使用 for 循环和子集化更好的方法来解决这个问题?
这在我的机器上快了大约 45%:
result = lapply(identityCodes, function(x) {
data.frame(identityCode=x,
code=database[database$start <= x & database$end >= x, "code"])
})
result = do.call(rbind, result)
这是输出示例:
identityCode code
1 6836845 OK
2 14100352 RB
3 2313115 NK
4 8440671 XN
5 11349271 TI
6 14467193 VL
这是我使用 data.table
的尝试。非常快 - 尽管我确信我没有有效地利用它。
给定函数:
# method 1
system.time(result1 <- fun(identityCodes, database))
user system elapsed
8.99 0.00 8.98
使用data.table
# method 2
require(data.table)
# x: a data.frame with columns start, end, code
# y: a vector with lookup codes
dt_comb <- function(x, y) {
# convert x to a data.table and set 'start' and 'end' as keys
DT <- setDT(x)
setkey(DT, start, end)
# create a lookup data.table where start and end are the identityCodes
DT2 <- data.table(start=y, end=y)
# overlap join where DT2 start & end are within DT start and end
res <- foverlaps(DT2, DT[, .(start, end)], type="within")
# store i as row number and key (for sorting later)
res[, i:=seq_len(nrow(res))]
setkey(res, i)
# merge the joined table to the original to get codes
final <- merge(res, DT, by=c("start", "end"))[order(i), .(code)]
# export as character the codes
as.character(final[[1]])
}
system.time(result2 <- dt_comb(x=database, y=identityCodes))
user system elapsed
0.08 0.00 0.08
identical(result1, result2)
[1] TRUE
编辑:从函数中删减了几行