R：在 0 和 1 的矩阵中查找包含最多 1 的列集

Question

我有一个由 1 和 0 组成的矩阵，其中行是个体，列是事件。 1 表示事件发生在个人身上，0 表示没有发生。

我想找出哪一组（在示例中）5 columns/events 覆盖最多 rows/individuals。

测试数据

#Make test data
set.seed(123)
d <- sapply(1:300, function(x) sample(c(0,1), 30, T, c(0.9,0.1)))
colnames(d) <- 1:300
rownames(d) <- 1:30

我的尝试

我最初的尝试只是基于将 5 列的集合与最高 colMeans:

#Get top 5 columns with highest row coverage
col_set <- head(sort(colMeans(d), decreasing = T), 5)

#Have a look the set
col_set

>
      197       199        59        80        76 
0.2666667 0.2666667 0.2333333 0.2333333 0.2000000

#Check row coverage of the column set
sum(apply(d[,colnames(d) %in% names(col_set)], 1, sum) > 0) / 30 #top 5

>
[1] 0.7

但是这个集合并没有涵盖最多的行。我通过伪随机抽样 10.000 组不同的 5 列进行测试，然后找到覆盖率最高的组：

#Get 5 random columns using colMeans as prob in sample
##Random sample 10.000 times
set.seed(123)
result <- lapply(1:10000, function(x){
  col_set2 <- sample(colMeans(d), 5, F, colMeans(d))
  cover <- sum(apply(d[,colnames(d) %in% names(col_set2)], 1, sum) > 0) / 30 #random 5
  list(set = col_set2, cover = cover)
})

##Have a look at the best set
result[which.max(sapply(result, function(x) x[["cover"]]))]

>
[[1]]
[[1]]$set
        59        169        262         68        197 
0.23333333 0.10000000 0.06666667 0.16666667 0.26666667 

[[1]]$cover
[1] 0.7666667

之所以将 colMeans 提供给 sample 是因为覆盖率最高的列是我最感兴趣的列。

因此，使用伪随机抽样，我可以收集一组覆盖率高于仅使用前 5 列的列。但是，由于我的实际数据集比示例大，所以我正在寻找一种更有效、更合理的方法来查找覆盖率最高的列集。

编辑

为了感兴趣，我决定 microbenchmark 提供的 3 个解决方案：

#Defining G. Grothendieck's coverage funciton outside his solutions
coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30

#G. Grothendieck top solution
solution1 <- function(d){
  cols <- tail(as.numeric(names(sort(colSums(d)))), 20)
  co <- combn(cols, 5)
  itop <- which.max(apply(co, 2, coverage))
  co[, itop]
}

#G. Grothendieck "Older solution"
solution2 <- function(d){
  require(lpSolve)
  ones <- rep(1, 300)
  res <- lp("max", colSums(d), t(ones), "<=", 5, all.bin = TRUE, num.bin.solns = 10)
  m <- matrix(res$solution[1:3000] == 1, 300)
  cols <- which(rowSums(m) > 0)
  co <- combn(cols, 5)
  itop <- which.max(apply(co, 2, coverage))
  co[, itop]
}

#user2554330 solution
bestCols <- function(d, n = 5) {
  result <- numeric(n)
  for (i in seq_len(n)) {
    result[i] <- which.max(colMeans(d))
    d <- d[d[,result[i]] != 1,, drop = FALSE]
  }
  result
}

#Benchmarking...
microbenchmark::microbenchmark(solution1 = solution1(d),
                               solution2 = solution2(d),
                               solution3 = bestCols(d), times = 10)

>
Unit: microseconds
      expr        min         lq       mean      median         uq       max neval
 solution1 390811.850 497155.887 549314.385 578686.3475 607291.286 651093.16    10
 solution2  55252.890  71492.781  84613.301  84811.7210  93916.544 117451.35    10
 solution3    425.922    517.843   3087.758    589.3145    641.551  25742.11    10

Answer 1

由于列的交互方式，这看起来像是一个相对困难的优化问题。一个近似的策略是选择具有最高平均值的列；然后删除该列中的行，然后重复。这样你不一定会找到最好的解决方案，但你应该会得到一个相当不错的解决方案。

例如，

set.seed(123)
d <- sapply(1:300, function(x) sample(c(0,1), 30, T, c(0.9,0.1)))
colnames(d) <- 1:300
rownames(d) <- 1:30
bestCols <- function(d, n = 5) {
  result <- numeric(n)
  for (i in seq_len(n)) {
    result[i] <- which.max(colMeans(d))
    d <- d[d[,result[i]] != 1,, drop = FALSE]
  }
  cat("final dim is ", dim(d))
  result
}
col_set <- bestCols(d)
sum(apply(d[,colnames(d) %in% col_set], 1, sum) > 0) / 30 #top 5

这提供了 90% 的覆盖率。

Answer 2

下面提供了一种寻找近似解的试探法。找到 N=20 列，比方说，最多的 cols，然后使用蛮力从这 20 列中找到 5 列的每个子集。覆盖率最高的子集如下所示，其覆盖率是93.3%.

coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30

N <- 20
cols <- tail(as.numeric(names(sort(colSums(d)))), N)

co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
## [1]  90 123 197 199 286

coverage(co[, itop])
## [1] 0.9333333

对 N=5、10、15 和 20 重复此操作，我们得到 83.3%、86.7%、90% 和 93.3% 的覆盖率。 N 越高，覆盖越好，但 N 越低，运行时间越少。

旧的解决方案

我们可以用一个背包问题来近似这个问题，该背包问题使用整数线性规划选择具有最大数量的 5 列。
我们得到了这个近似问题的 10 个最佳解决方案，得到了至少在 10 个解决方案中的一个的所有列。有 14 个这样的列，然后我们使用蛮力查找 14 列中的 5 个子集的覆盖率最高。

library(lpSolve)

ones <- rep(1, 300)
res <- lp("max", colSums(d), t(ones), "<=", 5, all.bin = TRUE, num.bin.solns = 10)

coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30

# each column of m is logical 300-vector defining possible soln
m <- matrix(res$solution[1:3000] == 1, 300)

# cols is the set of columns which are in any of the 10 solutions
cols <- which(rowSums(m) > 0)
length(cols)
## [1] 14

# use brute force to find the 5 best columns among cols
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
## [1]  90 123 197 199 286
coverage(co[, itop])
## [1] 0.9333333

Answer 3

您可以试试看有没有更好的栏目，与当前选择的栏目交换。

n <- 5 #Number of columns / events
i <- rep(1, n)
for(k in 1:10) { #How many times itterate
  tt <- i
  for(j in seq_along(i)) {
    x <- +(rowSums(d[,i[-j]]) > 0)
    i[j] <- which.max(colSums(x == 0 & d == 1))
  }
  if(identical(tt, i)) break
}
sort(i)
#[1]  90 123 197 199 286
mean(rowSums(d[,i]) > 0)
#[1] 0.9333333

考虑到初始条件会影响结果，您可以随机开始。

n <- 5 #Number of columns / events
x <- apply(d, 2, function(x) colSums(x == 0 & d == 1))
diag(x) <- -1
idx <- which(!apply(x==0, 1, any))
x <- apply(d, 2, function(x) colSums(x != d))
diag(x) <- -1
x[upper.tri(x)] <- -1
idx <- unname(c(idx, which(apply(x==0, 1, any))))
res <- sample(idx, n)
for(l in 1:100) {
  i <- sample(idx, n)
  for(k in 1:10) { #How many times itterate
    tt <- i
    for(j in seq_along(i)) {
      x <- +(rowSums(d[,i[-j]]) > 0)
      i[j] <- which.max(colSums(x == 0 & d == 1))
    }
    if(identical(tt, i)) break
  }
  if(sum(rowSums(d[,i]) > 0) > sum(rowSums(d[,res]) > 0)) res  <- i
}
sort(res)
#[1]  90 123 197 199 286
mean(rowSums(d[,res]) > 0)
#[1] 0.9333333

R：在 0 和 1 的矩阵中查找包含最多 1 的列集

R: Find set of columns which contain most 1s in matrix of 0 and 1

r

max

set

matrix

旧的解决方案