R 按重叠范围分组
R group by overlapping ranges
我有一个数据框,其中的行包含范围。我想确定范围组,其中每个范围与组中至少一个其他行重叠超过 75%。分组应作为索引变量添加到原始文件中。
示例数据如下:
df <- data.frame(label = c("A", "B", "C", "D", "E", "F"),
start = c(16, 18, 37, 62, 15, 45),
stop = c(22, 24, 55, 66, 23, 55))
生成的 df 应如下所示:
label start stop ID
"A" 6 22 1
"B" 6 24 1
"C" 37 55 2
"D" 62 66 3
"E" 15 23 1
"F" 45 55 2
首先,我尝试了 dplyr
选项与 mutate
和 lag
,但是分组取决于行的顺序并且在所有情况下都不起作用。接下来我尝试使用 seq_along
的 for 循环,但我无法解决问题。希望你们中的一个能...
overlap <- function(A, B) {
shared <- pmax(0, min(A[2], B[2]) - max(A[1], B[1]))
max(shared / c(diff(A), diff(B)))
}
eg <- expand.grid(a = seq_len(nrow(df)), b = seq_len(nrow(df)))
eg <- eg[eg$a < eg$b,]
together <- cbind(
setNames(df[eg$a,], paste0(names(df), "1")),
setNames(df[eg$b,], paste0(names(df), "2"))
)
together <- within(together, {
shared = pmax(0, pmin(stop1, stop2) - pmax(start1, start2))
overlap = pmax(shared / (stop1 - start1), shared / (stop2 - start2))
})[, c("label1", "label2", "overlap")]
bigenough <- together[together$overlap >= 0.75,]
groups <- split(bigenough$label2, bigenough$label1)
for (ltr in df$label) {
ind <- (ltr == names(groups)) | sapply(groups, `%in%`, x = ltr)
groups <- c(
setNames(list(unique(c(ltr, names(groups[ind]), unlist(groups[ind])))), ltr),
groups[!ind]
)
}
groups <- data.frame(
ID = rep(seq_along(groups), lengths(groups)),
label = unlist(groups)
)
结果:
merge(df, groups, by = "label")
# label start stop ID
# 1 A 16 22 2
# 2 B 18 24 2
# 3 C 37 55 1
# 4 D 62 66 3
# 5 E 15 23 2
# 6 F 45 55 1
你问的是没有 for
循环的方法。由于我们需要(循环的)一次迭代来处理 previous 迭代的结果,因此 lapply
对我们不起作用。但是,我们可以使用 Reduce
:
# groups <- split(...)
groups <- Reduce(function(grps, ltr) {
ind <- (ltr == names(grps)) | sapply(grps, `%in%`, x = ltr)
c(setNames(list(unique(c(ltr, names(grps[ind]), unlist(grps[ind])))), ltr),
grps[!ind])
}, df$label, init = groups)
# $F
# [1] "F" "C"
# $E
# [1] "E" "B" "A"
# $D
# [1] "D"
# groups <- data.frame(ID = ...)
# merge(df, groups, ...)
(然后是上面最后的 groups <- data.frame(..)
调用)。这同样有效。唯一的问题是 Reduce
是 使用 for
(https://github.com/wch/r-source/blob/d22ee2fc0dc8142b23eed9f46edf76ea9d3ca69a/src/library/base/R/funprog.R) :-)
我有一个数据框,其中的行包含范围。我想确定范围组,其中每个范围与组中至少一个其他行重叠超过 75%。分组应作为索引变量添加到原始文件中。
示例数据如下:
df <- data.frame(label = c("A", "B", "C", "D", "E", "F"),
start = c(16, 18, 37, 62, 15, 45),
stop = c(22, 24, 55, 66, 23, 55))
生成的 df 应如下所示:
label start stop ID
"A" 6 22 1
"B" 6 24 1
"C" 37 55 2
"D" 62 66 3
"E" 15 23 1
"F" 45 55 2
首先,我尝试了 dplyr
选项与 mutate
和 lag
,但是分组取决于行的顺序并且在所有情况下都不起作用。接下来我尝试使用 seq_along
的 for 循环,但我无法解决问题。希望你们中的一个能...
overlap <- function(A, B) {
shared <- pmax(0, min(A[2], B[2]) - max(A[1], B[1]))
max(shared / c(diff(A), diff(B)))
}
eg <- expand.grid(a = seq_len(nrow(df)), b = seq_len(nrow(df)))
eg <- eg[eg$a < eg$b,]
together <- cbind(
setNames(df[eg$a,], paste0(names(df), "1")),
setNames(df[eg$b,], paste0(names(df), "2"))
)
together <- within(together, {
shared = pmax(0, pmin(stop1, stop2) - pmax(start1, start2))
overlap = pmax(shared / (stop1 - start1), shared / (stop2 - start2))
})[, c("label1", "label2", "overlap")]
bigenough <- together[together$overlap >= 0.75,]
groups <- split(bigenough$label2, bigenough$label1)
for (ltr in df$label) {
ind <- (ltr == names(groups)) | sapply(groups, `%in%`, x = ltr)
groups <- c(
setNames(list(unique(c(ltr, names(groups[ind]), unlist(groups[ind])))), ltr),
groups[!ind]
)
}
groups <- data.frame(
ID = rep(seq_along(groups), lengths(groups)),
label = unlist(groups)
)
结果:
merge(df, groups, by = "label")
# label start stop ID
# 1 A 16 22 2
# 2 B 18 24 2
# 3 C 37 55 1
# 4 D 62 66 3
# 5 E 15 23 2
# 6 F 45 55 1
你问的是没有 for
循环的方法。由于我们需要(循环的)一次迭代来处理 previous 迭代的结果,因此 lapply
对我们不起作用。但是,我们可以使用 Reduce
:
# groups <- split(...)
groups <- Reduce(function(grps, ltr) {
ind <- (ltr == names(grps)) | sapply(grps, `%in%`, x = ltr)
c(setNames(list(unique(c(ltr, names(grps[ind]), unlist(grps[ind])))), ltr),
grps[!ind])
}, df$label, init = groups)
# $F
# [1] "F" "C"
# $E
# [1] "E" "B" "A"
# $D
# [1] "D"
# groups <- data.frame(ID = ...)
# merge(df, groups, ...)
(然后是上面最后的 groups <- data.frame(..)
调用)。这同样有效。唯一的问题是 Reduce
是 使用 for
(https://github.com/wch/r-source/blob/d22ee2fc0dc8142b23eed9f46edf76ea9d3ca69a/src/library/base/R/funprog.R) :-)