如何优化矩阵中行和列的交集?

How to optimize intersect of rows and columns in a matrix?

在矩阵中,例如M1,行是国家,列是年份。这些国家没有同一年的观测资料。我想找到给我最多国家的“最佳”年份交叉点。最低年限和最低国家/地区的数量将预先确定。结果中包括哪些国家并不重要,年份不必是连续的。

> M1
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
 [1,]   NA   NA   NA 2004   NA 2006   NA 2008 2009    NA  2011  2012    NA    NA    NA
 [2,]   NA 2002   NA 2004   NA   NA 2007   NA   NA  2010  2011    NA  2013  2014    NA
 [3,]   NA   NA   NA 2004 2005 2006 2007 2008 2009    NA    NA  2012  2013    NA  2015
 [4,]   NA 2002   NA 2004 2005 2006 2007 2008   NA  2010  2011    NA  2013    NA    NA
 [5,] 2001   NA   NA   NA 2005 2006 2007 2008   NA  2010    NA  2012  2013  2014    NA
 [6,] 2001   NA 2003 2004 2005 2006 2007 2008 2009  2010  2011  2012    NA  2014    NA
 [7,] 2001 2002   NA   NA 2005   NA 2007   NA 2009    NA  2011    NA    NA  2014  2015
 [8,] 2001 2002   NA 2004 2005 2006   NA   NA   NA  2010    NA    NA  2013    NA  2015
 [9,]   NA 2002   NA 2004 2005   NA 2007   NA   NA  2010  2011    NA    NA    NA    NA
[10,] 2001 2002   NA 2004   NA   NA   NA   NA   NA  2010    NA  2012    NA  2014  2015

因为没有明显的交叉点,所以一次 Reduce(intersect...) 尝试是行不通的,我通过连续排除一个国家达到定义的阈值 n.row 来重复这样做。结果至少过滤了 n.col 年。我写了这个函数,

findBestIntersect <- function(M, min.row=5, min.col=3) {
  ## min.row: minimum number of rows (countries) to analyze
  ## min.col: minimum number of complete columns (years)
  # put matrices with row combn into list (HUGE!)
  L1 <- lapply(min.row:(nrow(M) - 1), function(x)
    combn(nrow(M), x, function(i) M[i, ], simplify=FALSE))
  # select lists w/ def. number of complete columns
  slc <- sapply(L1, function(y)  # numbers of lists
    which(sapply(y, function(x)
      sum(!(apply(x, 2, function(i) any(is.na(i))))))
      >= min.col))
  # list selected lists
  L2 <- Map(function(x, i)
    x[i], L1[lengths(slc) > 0], slc[lengths(slc) > 0])
  # find intersects
  L3 <- rapply(L2, function(l)
    as.integer(na.omit(Reduce(intersect, as.list(as.data.frame(t(l)))))),
    how="list")
  return(unique(unlist(L3, recursive=FALSE)))
}

这让我很快就得到了 M1 想要的结果。

> system.time(best.yrs.1 <- findBestIntersect(M1))
   user  system elapsed 
   0.06    0.00    0.07 

> best.yrs.1
[[1]]
[1] 2002 2004 2010

但是 M2 的性能只是勉强可以接受(RAM 使用量约为 1.1 GB),

> system.time(best.yrs.2 <- findBestIntersect(M2))
   user  system elapsed 
  79.90    0.39   82.76 
> head(best.yrs.2, 3)
[[1]]
[1] 2002 2009 2015

[[2]]
[1] 2002 2014 2015

[[3]]
[1] 2003 2009 2010

并且您不想使用类似于我的真实矩阵的 M3(爆炸 32 GB RAM)来尝试此操作:

# best.yrs.3 <- findBestIntersect(M3)

函数最大的缺陷可能是 L1 变得太大非常快。

所以,我的问题是,是否有更好的方法也适用于M3? “奖金”将最大化国家和年份。如果可能的话,我想在没有额外包的情况下做到这一点。

数据

set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 150, replace=TRUE), 10)
M1 <- t(replicate(10, 2001:2015, simplify=TRUE))
M1[tf] <- NA

tf <- matrix(sample(c(TRUE, FALSE), 300, replace=TRUE), 20)
M2 <- t(replicate(20, 2001:2015, simplify=TRUE))
M2[tf] <- NA

tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA

由于组合使用了大量内存而没有给出任何结果(至少在我的电脑上它给出了错误),也许聚类数据可以提供解决方案。

数据:set.seed(42)

> M1
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
 [1,] 2001   NA 2003 2004   NA   NA 2007   NA 2009  2010  2011  2012    NA  2014    NA
 [2,] 2001 2002   NA 2004   NA   NA 2007   NA   NA    NA    NA  2012    NA  2014    NA
 [3,]   NA 2002 2003   NA   NA   NA 2007   NA   NA    NA    NA  2012  2013  2014    NA
 [4,] 2001   NA 2003 2004 2005 2006 2007   NA 2009  2010    NA    NA  2013  2014  2015
 [5,] 2001   NA   NA   NA   NA   NA 2007   NA 2009  2010  2011  2012  2013  2014  2015
 [6,] 2001 2002 2003 2004 2005 2006   NA 2008 2009  2010  2011  2012    NA  2014  2015
 [7,] 2001 2002   NA   NA 2005 2006   NA   NA   NA    NA  2011    NA  2013  2014    NA
 [8,]   NA   NA 2003   NA 2005   NA 2007   NA   NA  2010  2011    NA  2013    NA  2015
 [9,] 2001   NA   NA 2004 2005   NA 2007 2008   NA  2010  2011  2012    NA  2014    NA
[10,] 2001 2002 2003 2004 2005 2006   NA   NA   NA  2010    NA  2012  2013  2014    NA

我使用 klaR 库来使用 kmodes 函数,该函数用于对分类变量进行聚类(类似于 kmeans

函数:

  library(klaR)
  library(matrixStats)

opt <- function(data,iter) {
        my_list <- list()   # It will be assigned by the cluster outputs
        t_M <- t(!is.na(data)) # Transforming the data into TRUE/FALSE in other words 1/0   
        result <- lapply(1:iter, function(rand) { # This is for assigning new seeds.
                    set.seed(rand^2)
                    for(i in 2:nrow(t_M)-1) { # Loop for the  Number of the clusters 

                        # cluster function
                        cluster.results <-kmodes(t_M, i, iter.max = 100, weighted = FALSE , fast = TRUE) 


                    # Binding the data and clusters output. And adding index for finding the location                   
                    df <- as.data.frame(cbind(Index=1:nrow(t_M),Cluster=cluster.results$cluster,t_M)) 

                    # Below is the calculation step. It multiplies the columns. The idea is here if 
                    # there exist an intersection, the product must be 1. Otherwise it is 0.
                    calc <-  sapply(1:i , function(x){

                                if(nrow(df[df[,"Cluster"]==x,])<=1) {
                                    quantity <- 0
                                }else{

                                    quantity <- sum(colProds(as.matrix(df[df[,"Cluster"]==x,][,- c(1,2)]))) 
                                }
                             })
                    # Creating a new df named out, just for arranging the output.
                    out <- data.frame(Cluster= 1:i ,
                                        Quantity= calc,
                                            Size= cluster.results$size,
                                                Iter = paste0(rand,"-",i)   )
                    # Merging our first dataframw (df) with out (without unimportant columns)                       
                    my_list[[i]] <- merge(df[, c("Cluster","Index")],out,by="Cluster")

                }
            # Binding all outputs in my_list. It includes all clusters from 1:nrow(t_M)
            do.call(rbind,my_list)

            })
    # Since we have done the same steps as much as the Iter input, This is for binding all.
    # So it includes, iter * clusteroutput data
    result <- do.call(rbind,result)
    # Neglecting unnecessary columns 
    result <- result[,-(ncol(result)-2)]

    colnames(result) <- c("Cluster","Index","Matching","Years","Iter")

return(result)
}

聚类过于依赖种子。因此,当播种发生变化时,聚类输出可能会发生变化。出于这个原因,我计算了关于不同种子的可能集群。这里的想法是检测相似的组并在不深入所有组合的情况下对它们进行计算。

#100 is the number of the seeds, when it increases the calculation time also increases. 
#For calculating M3 matrix, it may be good to decrease the seed amount.
#However, it may decrease the quality of the clustering. (Trade off)

my_result <- opt(M1,100) # It takes about one minute in my computer.

tail(my_result)

      Cluster Index Matching Years   Iter
22495      10     1        0     1 100-15
22496      11    13        0     1 100-15
22497      12     2        0     1 100-15
22498      13    10        0     1 100-15
22499      14     5        0     1 100-15
22500      15     8        0     1 100-15

# Years : Number of the Years in the solution
# Matching : Number of the matches in a cluster
# Cluster : The Cluster Number
# Iter : Iteration of Cluster
# Index : The index of where the cluster is located at the input dataframe

从此以后,就变成了数据操作的工作。可以进行任何过滤。例如,假设我只想查看具有最大匹配的 3 年输出。所以,

out_list1 <- my_result %>% filter(Years==3) %>% 
            filter(Matching==max(Matching)) %>% 
            group_by(Iter,Cluster,add = TRUE) %>% 
            select(Index) %>% group_split() 

得到out_list后,使用下面的函数可以达到想要的输出。

函数:

find_match <- function(output,data) {

x <- unique(lapply(1:length(output),function(k) {
        sort(t(output[[k]][3]))
        }))
return(lapply(1:length(x),function(i) na.omit(unique(data[,x[[i]]]))[1,]))

}

如果我们调用 find_match,

find_match(out_list1,M1)

[[1]]
[1] 2001 2004 2014

[[2]]
[1] 2001 2010 2014

[[3]]
[1] 2001 2012 2014

另一个试验: 复制您的输入findBestIntersect(M1)

out_list2 <- my_result %>% filter(Years>=3) %>%
            filter(Matching>=5) %>% 
            group_by(Iter,Cluster,add = TRUE) %>% 
            select(Index) %>% group_split() 

find_match(out_list2,M1)

[[1]]
[1] 2001 2010 2014

[[2]]
[1] 2001 2004 2010 2014

[[3]]
[1] 2001 2004 2014

[[4]]
[1] 2001 2004 2012 2014

[[5]]
[1] 2001 2007 2014

[[6]]
[1] 2001 2012 2014

[[7]]
[1] 2007 2012 2014

[[8]]
[1] 2001 2004 2010

[[9]]
[1] 2001 2011 2014

[[10]]
[1] 2001 2010 2012 2014

不同的是,你的输出还给出了

2001 2005 2014

由于 kmodes 是一个分类器,我们不能指望完美拟合。所以这是所有预测算法都预期的错误分类错误。但有趣的是,你的功能也错过了,

2001 2004 2010

这也是一种解决方法。

我编写了一个 coded_best_intersect 函数,该函数依赖于在 code_maker 函数中动态创建 for 循环。它在 30 秒内评估 M3。因为代码生成了一个列表,所以我依赖于 data.table for rbindlist 和 print 方法。

library(data.table)

code_maker函数:

code_maker <- function(non_NA_M, n, k, min.col) {
  ## initializing for results
  res <- list()
  z <- 1
  ## initializing naming
  col_names <- colnames(non_NA_M)
  i_s <- paste0('i', seq_len(k))
  ## create the foor loop text. It looks like this mostly
  ## for (i1 in 1:(n - k + 1)) { for (i2 in (i1 + 1):(n-k+2)) {}}
  for_loop <- paste0('for (', i_s, ' in ', c('1:', paste0('(', i_s[-k], ' + 1):')), 
                     n - k + seq_len(k), ')', ' {\n non_na_sums', seq_len(k), 
                     '=non_NA_M[', i_s, ', ] ',
                     c('', paste0('& ', rep('non_na_sums', k - 1), seq_len(k)[-k])), '', 
                     '\n if (sum(non_na_sums', seq_len(k), ') < ', min.col, ') {next} ', 
                     collapse='\n')
  ## create the assignment back to the results which looks like
  ## res[[z]] <- data.table(M=k, N=sum(non_na_sumsk), ROWS=list(c(i1, i2, ..., ik)), 
  ##                        YEARS=list(col_names[non_na_sumsk]))
  inner_text <- paste0('\nres[[z]] <- data.table(M=k, N=sum(non_na_sums',
                       k, '), ROWS=list(c( ', paste0(i_s, collapse=', '), 
                       ')), YEARS=list(col_names[non_na_sums', k , ']))\nz <- z + 1')
  ## combines the loop parts and closes the for with }}}
  for_loop <- paste(for_loop, 
                    inner_text, 
                    paste0(rep('}', k), collapse=''))
  ## evaluate - the evaluation will assign back to res[[i]]  
  eval(parse(text=for_loop))
  res <- rbindlist(res)
  if (length(res) == 0) { #to return emtpy data.table with the correct fields
    return(data.table(M=integer(), N=integer(), ROWS=list(), YEARS=list()))
  }
  res$M <- k
  return(res)
}

coded_best_intersect函数:

coded_best_intersect <- function(M, min.row=5, min.col=3) {
  colnames(M) <- apply(M, 2, function(x) na.omit(x)[1])
  n_row <- nrow(M)
  non_NA <- !is.na(M)
  n_combos <- min.row:(n_row - 1)
  res2 <- list()
  for (i in seq_along(n_combos)) {
    res2[[i]] <- code_maker(non_NA, n=n_row, k=n_combos[i], min.col)
    if (nrow(res2[[i]]) == 0) {
      break
    }
  }
  return(res2)
}

例如k=5:

动态生成的代码
# for (i1 in 1:5) {
#   non_na_sums1=non_NA_M[i1, ] 
#   if (sum(non_na_sums1) < 3) {next} 
#   for (i2 in (i1 + 1):6) {
#     non_na_sums2=non_NA_M[i2, ] & non_na_sums1
#     if (sum(non_na_sums2) < 3) {next} 
#     for (i3 in (i2 + 1):7) {
#       non_na_sums3=non_NA_M[i3, ] & non_na_sums2
#       if (sum(non_na_sums3) < 3) {next} 
#       for (i4 in (i3 + 1):8) {
#         non_na_sums4=non_NA_M[i4, ] & non_na_sums3
#         if (sum(non_na_sums4) < 3) {next} 
#         for (i5 in (i4 + 1):9) {
#           non_na_sums5=non_NA_M[i5, ] & non_na_sums4
#           if (sum(non_na_sums5) < 3) {next} 
#           for (i6 in (i5 + 1):10) {
#             non_na_sums6=non_NA_M[i6, ] & non_na_sums5
#             if (sum(non_na_sums6) < 3) {next}  
#             res[[z]] <- data.table(M=k, N=sum(non_na_sums6), 
#                                    ROWS=list(c( i1, i2, i3, i4, i5, i6)),
#                                    YEARS=list(col_names[non_na_sums6]))
#             z <- z + 1 }}}}}}

您可能会注意到 {next},这是一种在无法获得至少 3 列的情况下跳过组合的方法。虽然看起来都是硬编码的,但代码实际上是生成、解析然后求值的字符串。

用法和性能

矩阵M1:

system.time(final1 <- coded_best_intersect(M1))
   user  system elapsed 
      0       0       0 
data.table::rbindlist(final1)[order(-M*N)]
   M N           ROWS          YEARS
1: 5 3  2, 4, 8, 9,10 2002,2004,2010

矩阵M2:

system.time(final2 <- coded_best_intersect(M2))
   user  system elapsed 
   0.08    0.00    0.08 
data.table::rbindlist(final2)[order(-M*N)]
     M N                  ROWS               YEARS
  1: 7 3  6, 8,11,12,13,16,...      2002,2012,2013
  2: 5 4         6, 8,13,16,17 2002,2012,2013,2015
  3: 5 4         8,11,12,13,17 2002,2012,2013,2014
  4: 6 3      1, 4, 8,13,17,20      2002,2014,2015
  5: 6 3      2, 5, 6,10,14,17      2003,2006,2008
 ---                                              
126: 5 3        10,12,13,17,20      2002,2008,2014
127: 5 3        10,12,14,17,20      2003,2008,2014
128: 5 3        11,12,13,16,17      2002,2012,2013
129: 5 3        11,12,13,17,20      2002,2012,2014
130: 5 3        12,13,15,16,19      2001,2002,2013

矩阵M3:

system.time(final3 <- coded_best_intersect(M3))
   user  system elapsed 
  29.37    0.05   29.54 
data.table::rbindlist(final3)[order(-M*N)]
       M N              ROWS                             YEARS
    1: 6 7  1, 3, 8,15,20,29 1969,1973,1980,1984,1985,1992,...
    2: 5 8     1, 3, 8,14,29 1969,1973,1976,1980,1984,1987,...
    3: 5 8     1, 3, 8,20,29 1969,1973,1980,1984,1985,1992,...
    4: 5 8     2, 7, 9,13,17 1974,1993,1994,2004,2012,2013,...
    5: 5 8     3, 6, 8, 9,27 1974,1980,1984,1987,1995,1998,...
   ---                                                        
52374: 5 3    23,24,25,30,31                    1979,1997,2002
52375: 5 3    23,25,28,30,31                    1979,1992,2002
52376: 5 3    24,25,26,30,31                    1983,1997,2002
52377: 5 3    24,25,28,30,31                    1979,1983,2002
52378: 5 3    24,26,28,30,31                    1983,1986,2002

要将结果的选定部分放入字符串中,您可以执行例如以下:

x <- data.table::rbindlist(final3)[order(-M*N)]
el(x$YEARS[1])  # select `YEARS` of result-row `1:`
# [1] "1969" "1973" "1980" "1984" "1985" "1992" "2003"

注意:查看其他两种截然不同的方法的编辑历史。第一个是 melt 和 join 技术,它会破坏内存。第二种方法是使用 RcppAlgos::comboGeneral 来计算函数。

这是一个使用混合整数规划的小问题,即使使用像 glpk 这样的弱开源求解器也可以很快解决。我正在使用 ompr 包进行数学建模(有关 ompr 的更多信息),并将模型逻辑作为注释包含在代码中。请注意,由于我猜 R 版本不同,我的随机数据与 OP 的不同。

当模型设置为最多 15 年的数据最大化时,M3 的总 运行 时间约为一分钟(即实际求解时间甚至更少)。此方法可以轻松扩展到更大的实例。

library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)

set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA

m <- +!is.na(M3) # gets logical matrix; 0 if NA else 1    
nr <- nrow(m)
nc <- ncol(m)
n_years <- 15 

model <- MIPModel() %>% 
  # keep[i,j] is 1 if matrix cell [i,j] is to be kept else 0
  add_variable(keep[i,j], i = 1:nr, j = 1:nc, typ = "binary") %>% 
  # rm_row[i] is 1 if row i is selected for removal else 0
  add_variable(rm_row[i], i = 1:nr, type = "binary") %>% 
  # rm_col[j] is 1 if column j is selected for removal else 0
  add_variable(rm_col[j], j = 1:nc, type = "binary") %>% 
  # maximize good cells kept
  set_objective(sum_expr(keep[i,j], i = 1:nr, j = 1:nc), "max") %>% 
  # cell can be kept only when row is not selected for removal
  add_constraint(sum_expr(keep[i,j], j = 1:nc) <= 1 - rm_row[i], i = 1:nr) %>%
  # cell can be kept only when column is not selected for removal
  add_constraint(sum_expr(keep[i,j], i = 1:nr) <= 1 - rm_col[j], j = 1:nc) %>%
  # only non-NA values can be kept
  add_constraint(m[i,j] + rm_row[i] + rm_col[j] >= 1, i = 1:nr, j = 1:nc) %>% 
  # keep at most n_years columns i.e. remove at least (nc - n_years) columns
  # I used >= instead of == to avoid infeasiblity
  add_constraint(sum_expr(rm_col[j], j = 1:nc) >= nc - n_years) %>% 
  # solve using free glpk solver
  solve_model(with_ROI(solver = "glpk"))

结果-

solver_status(model)
# [1] "optimal"    <- indicates guaranteed optimum (at least one of the many possible)

# get rows to remove
rm_rows <- model %>% 
  get_solution(rm_row[i]) %>% 
  filter(value > 0) %>% pull(i) %>% print()

# [1]  1  2  3  4  6  8  9 11 12 13 14 15 17 18 19 20 21 22 23 25 27 28 29 30 31

# get columns to remove
rm_cols <- model %>% 
  get_solution(rm_col[j]) %>% 
  filter(value > 0) %>% pull(j) %>% print()

# [1]  2  3  4  5  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# [24] 27 28 29 30 31 32 33 34 35 36 38 39 40 41 44 45 46 47 48

result <- M3[-rm_rows, -rm_cols, drop = F]

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1969 1974 1994 2005 2010 2011
[2,] 1969 1974 1994 2005 2010 2011
[3,] 1969 1974 1994 2005 2010 2011
[4,] 1969 1974 1994 2005 2010 2011
[5,] 1969 1974 1994 2005 2010 2011
[6,] 1969 1974 1994 2005 2010 2011