R，在向量化的范围内加入

Question

我正在尝试连接两个数据集，其中一个数据集中的变量（或基因组中的位置）适合第二个（基因 start/stop 位置）的范围。然而，位置不是唯一的，而是嵌套在一个额外的列（染色体）中。基因 start/stop 位置也是如此。我的目标是link每个位置都有相应的注解和效果。

例如：

library(sqldf)
set.seed(100)
a <- data.frame(
    annotation = sample(c("this", "that", "other"), 3, replace=TRUE),
    start = seq(1, 30, 10),
    chr = sample(1:3, 3, replace=TRUE)
  )
a$stop <- a$start + 10
b <- data.frame(
    chr = sample(1:3, 3, replace=TRUE),
    position = sample(1:15, 3, replace=TRUE),
    effect = sample(c("high", "low"), 3, replace=TRUE)
  )

SQL 内部连接让我完成了部分工作：

df<-sqldf("SELECT a.start, a.stop, a.annotation, b.effect, b.position
    FROM a, b
    inner JOIN a b on(b.position >= a.start and b.position <= a.stop);")

但这并没有说明每条染色体的位置重复。我在将其包装到循环或应用函数中时遇到概念上的问题。

我并不拘泥于SQL，这只是我之前解决一个更简单问题的方式。我也不确定制作额外的索引列是否合适，因为我有数千个染色体值。

我想要的输出如下所示：

    df$chr<-c("NA","2","2")
      start stop annotation effect position chr
1     1   11       this   high        3  NA
2     1   11       this   high       10  NA
3    11   21       this    low       14   2

每个 position 都被放置在正确 chr 上的 start 和 stop 点之间，或者给定的 NA 上没有任何点chr 匹配。

Answer 1

我想这就是你想要的：

sqldf(
    "Select start, stop, annotation, effect, position,
    case when a.chr = b.chr then a.chr else NULL end as chr
    from b left join a
    on b.position between a.start and a.stop
    "
)
#   start stop annotation effect position chr
# 1     1   11       this   high        3  NA
# 2     1   11       this   high       10  NA
# 3    11   21       this    low       14   2

Answer 2

data.table 的 development version 引入了非相等连接，允许：

library(data.table)
setDT(a) # converting to data.table in place
setDT(b)

b[a, on = .(position >= start, position <= stop), nomatch = 0,
  .(start, stop, annotation, effect, x.position, chr = ifelse(i.chr == x.chr, i.chr, NA))]
#   start stop annotation effect x.position chr
#1:     1   11       this   high          3  NA
#2:     1   11       this   high         10  NA
#3:    11   21       this    low         14   2

R，在向量化的范围内加入

R, join within a range vectorised

merge

r

sqldf

dplyr