data.table: 获取每组的索引 ||两个 data.table

Question

我有两个 data.table 形式的数据框。一个分组数据，我想从第二个 data.table 帧中提取值索引。以下是示例数据

snp_bygene<-data.table(V2=c("SNP1","SNP2","SNP3","SNP4","SNP5","SNP11","SNP12","SNP13","SNP14","SNP15"),
GENE=c( rep("GENE1",5),rep("GENE2",5) ),START=c(rep(100,5),rep(200,5)),END=c(rep(190,5),rep(290,5)) )

snp_data<-data.table(V2=c("SNP1","SNP2","SNP3","SNP4","SNP5","SNP11","SNP12","SNP13","SNP14","SNP15"),BP=c(101,102,105,110,125,201,202,205,210,225))

我想获得 snp_bygene 中与 snp_data V2 匹配的 V2 的索引。每个基因我想获得 SNP 位置。

setkey(snp_data, V2)
snp_bygene[snp_data]

我该如何继续？

最终输出如下：
finalindex_perGene<-list("GENE1"=c(1, 2, 3, 4, 5) , "GENE2" =c(6, 7, 8, 9, 10))

编辑 1：snp_data

中没有 GENE 组

Answer 1

我们可以使用 'BP' 和 'V2' 进行非等连接 on，'START'，'END' 列，获取行索引使用 .I，在 list 中附加 'GENE' 列，然后在 .I 中附加 split （I 是为 [= 创建的默认列名13=] 因为我们没有指定任何列名 - 它可以通过 .(I = .I, GENE)) 明确显示，通过 'GENE'

with(snp_bygene[snp_data, .(.I, GENE), on = .(V2, START <= BP, 
       END >= BP)], split(I, GENE))

-输出

#$GENE1
#[1] 1 2 3 4 5

#$GENE2
#[1]  6  7  8  9 10

Answer 2

如果不涉及BP合并数据表（即仅依赖V2），我们可以使用chmatch来获取匹配的行索引，例如

> with(snp_bygene, split(chmatch(snp_bygene[, V2], snp_data[, V2]), GENE))
$GENE1
[1] 1 2 3 4 5

$GENE2
[1]  6  7  8  9 10

否则，您可能需要非等价和等价连接，因为

data.table: 获取每组的索引 ||两个 data.table

data.table: get index per group || two data.table

grouping

r

match

data.table