如何优化矩阵中行和列的交集?
How to optimize intersect of rows and columns in a matrix?
在矩阵中,例如M1
,行是国家,列是年份。这些国家没有同一年的观测资料。我想找到给我最多国家的“最佳”年份交叉点。最低年限和最低国家/地区的数量将预先确定。结果中包括哪些国家并不重要,年份不必是连续的。
> M1
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] NA NA NA 2004 NA 2006 NA 2008 2009 NA 2011 2012 NA NA NA
[2,] NA 2002 NA 2004 NA NA 2007 NA NA 2010 2011 NA 2013 2014 NA
[3,] NA NA NA 2004 2005 2006 2007 2008 2009 NA NA 2012 2013 NA 2015
[4,] NA 2002 NA 2004 2005 2006 2007 2008 NA 2010 2011 NA 2013 NA NA
[5,] 2001 NA NA NA 2005 2006 2007 2008 NA 2010 NA 2012 2013 2014 NA
[6,] 2001 NA 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 NA 2014 NA
[7,] 2001 2002 NA NA 2005 NA 2007 NA 2009 NA 2011 NA NA 2014 2015
[8,] 2001 2002 NA 2004 2005 2006 NA NA NA 2010 NA NA 2013 NA 2015
[9,] NA 2002 NA 2004 2005 NA 2007 NA NA 2010 2011 NA NA NA NA
[10,] 2001 2002 NA 2004 NA NA NA NA NA 2010 NA 2012 NA 2014 2015
因为没有明显的交叉点,所以一次 Reduce(intersect...)
尝试是行不通的,我通过连续排除一个国家达到定义的阈值 n.row
来重复这样做。结果至少过滤了 n.col
年。我写了这个函数,
findBestIntersect <- function(M, min.row=5, min.col=3) {
## min.row: minimum number of rows (countries) to analyze
## min.col: minimum number of complete columns (years)
# put matrices with row combn into list (HUGE!)
L1 <- lapply(min.row:(nrow(M) - 1), function(x)
combn(nrow(M), x, function(i) M[i, ], simplify=FALSE))
# select lists w/ def. number of complete columns
slc <- sapply(L1, function(y) # numbers of lists
which(sapply(y, function(x)
sum(!(apply(x, 2, function(i) any(is.na(i))))))
>= min.col))
# list selected lists
L2 <- Map(function(x, i)
x[i], L1[lengths(slc) > 0], slc[lengths(slc) > 0])
# find intersects
L3 <- rapply(L2, function(l)
as.integer(na.omit(Reduce(intersect, as.list(as.data.frame(t(l)))))),
how="list")
return(unique(unlist(L3, recursive=FALSE)))
}
这让我很快就得到了 M1
想要的结果。
> system.time(best.yrs.1 <- findBestIntersect(M1))
user system elapsed
0.06 0.00 0.07
> best.yrs.1
[[1]]
[1] 2002 2004 2010
但是 M2
的性能只是勉强可以接受(RAM 使用量约为 1.1 GB),
> system.time(best.yrs.2 <- findBestIntersect(M2))
user system elapsed
79.90 0.39 82.76
> head(best.yrs.2, 3)
[[1]]
[1] 2002 2009 2015
[[2]]
[1] 2002 2014 2015
[[3]]
[1] 2003 2009 2010
并且您不想使用类似于我的真实矩阵的 M3
(爆炸 32 GB RAM)来尝试此操作:
# best.yrs.3 <- findBestIntersect(M3)
函数最大的缺陷可能是 L1
变得太大非常快。
所以,我的问题是,是否有更好的方法也适用于M3
? “奖金”将最大化国家和年份。如果可能的话,我想在没有额外包的情况下做到这一点。
数据
set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 150, replace=TRUE), 10)
M1 <- t(replicate(10, 2001:2015, simplify=TRUE))
M1[tf] <- NA
tf <- matrix(sample(c(TRUE, FALSE), 300, replace=TRUE), 20)
M2 <- t(replicate(20, 2001:2015, simplify=TRUE))
M2[tf] <- NA
tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA
由于组合使用了大量内存而没有给出任何结果(至少在我的电脑上它给出了错误),也许聚类数据可以提供解决方案。
数据:set.seed(42)
> M1
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] 2001 NA 2003 2004 NA NA 2007 NA 2009 2010 2011 2012 NA 2014 NA
[2,] 2001 2002 NA 2004 NA NA 2007 NA NA NA NA 2012 NA 2014 NA
[3,] NA 2002 2003 NA NA NA 2007 NA NA NA NA 2012 2013 2014 NA
[4,] 2001 NA 2003 2004 2005 2006 2007 NA 2009 2010 NA NA 2013 2014 2015
[5,] 2001 NA NA NA NA NA 2007 NA 2009 2010 2011 2012 2013 2014 2015
[6,] 2001 2002 2003 2004 2005 2006 NA 2008 2009 2010 2011 2012 NA 2014 2015
[7,] 2001 2002 NA NA 2005 2006 NA NA NA NA 2011 NA 2013 2014 NA
[8,] NA NA 2003 NA 2005 NA 2007 NA NA 2010 2011 NA 2013 NA 2015
[9,] 2001 NA NA 2004 2005 NA 2007 2008 NA 2010 2011 2012 NA 2014 NA
[10,] 2001 2002 2003 2004 2005 2006 NA NA NA 2010 NA 2012 2013 2014 NA
我使用 klaR
库来使用 kmodes
函数,该函数用于对分类变量进行聚类(类似于 kmeans
)
函数:
library(klaR)
library(matrixStats)
opt <- function(data,iter) {
my_list <- list() # It will be assigned by the cluster outputs
t_M <- t(!is.na(data)) # Transforming the data into TRUE/FALSE in other words 1/0
result <- lapply(1:iter, function(rand) { # This is for assigning new seeds.
set.seed(rand^2)
for(i in 2:nrow(t_M)-1) { # Loop for the Number of the clusters
# cluster function
cluster.results <-kmodes(t_M, i, iter.max = 100, weighted = FALSE , fast = TRUE)
# Binding the data and clusters output. And adding index for finding the location
df <- as.data.frame(cbind(Index=1:nrow(t_M),Cluster=cluster.results$cluster,t_M))
# Below is the calculation step. It multiplies the columns. The idea is here if
# there exist an intersection, the product must be 1. Otherwise it is 0.
calc <- sapply(1:i , function(x){
if(nrow(df[df[,"Cluster"]==x,])<=1) {
quantity <- 0
}else{
quantity <- sum(colProds(as.matrix(df[df[,"Cluster"]==x,][,- c(1,2)])))
}
})
# Creating a new df named out, just for arranging the output.
out <- data.frame(Cluster= 1:i ,
Quantity= calc,
Size= cluster.results$size,
Iter = paste0(rand,"-",i) )
# Merging our first dataframw (df) with out (without unimportant columns)
my_list[[i]] <- merge(df[, c("Cluster","Index")],out,by="Cluster")
}
# Binding all outputs in my_list. It includes all clusters from 1:nrow(t_M)
do.call(rbind,my_list)
})
# Since we have done the same steps as much as the Iter input, This is for binding all.
# So it includes, iter * clusteroutput data
result <- do.call(rbind,result)
# Neglecting unnecessary columns
result <- result[,-(ncol(result)-2)]
colnames(result) <- c("Cluster","Index","Matching","Years","Iter")
return(result)
}
聚类过于依赖种子。因此,当播种发生变化时,聚类输出可能会发生变化。出于这个原因,我计算了关于不同种子的可能集群。这里的想法是检测相似的组并在不深入所有组合的情况下对它们进行计算。
#100 is the number of the seeds, when it increases the calculation time also increases.
#For calculating M3 matrix, it may be good to decrease the seed amount.
#However, it may decrease the quality of the clustering. (Trade off)
my_result <- opt(M1,100) # It takes about one minute in my computer.
tail(my_result)
Cluster Index Matching Years Iter
22495 10 1 0 1 100-15
22496 11 13 0 1 100-15
22497 12 2 0 1 100-15
22498 13 10 0 1 100-15
22499 14 5 0 1 100-15
22500 15 8 0 1 100-15
# Years : Number of the Years in the solution
# Matching : Number of the matches in a cluster
# Cluster : The Cluster Number
# Iter : Iteration of Cluster
# Index : The index of where the cluster is located at the input dataframe
从此以后,就变成了数据操作的工作。可以进行任何过滤。例如,假设我只想查看具有最大匹配的 3 年输出。所以,
out_list1 <- my_result %>% filter(Years==3) %>%
filter(Matching==max(Matching)) %>%
group_by(Iter,Cluster,add = TRUE) %>%
select(Index) %>% group_split()
得到out_list
后,使用下面的函数可以达到想要的输出。
函数:
find_match <- function(output,data) {
x <- unique(lapply(1:length(output),function(k) {
sort(t(output[[k]][3]))
}))
return(lapply(1:length(x),function(i) na.omit(unique(data[,x[[i]]]))[1,]))
}
如果我们调用 find_match
,
find_match(out_list1,M1)
[[1]]
[1] 2001 2004 2014
[[2]]
[1] 2001 2010 2014
[[3]]
[1] 2001 2012 2014
另一个试验: 复制您的输入findBestIntersect(M1)
out_list2 <- my_result %>% filter(Years>=3) %>%
filter(Matching>=5) %>%
group_by(Iter,Cluster,add = TRUE) %>%
select(Index) %>% group_split()
find_match(out_list2,M1)
[[1]]
[1] 2001 2010 2014
[[2]]
[1] 2001 2004 2010 2014
[[3]]
[1] 2001 2004 2014
[[4]]
[1] 2001 2004 2012 2014
[[5]]
[1] 2001 2007 2014
[[6]]
[1] 2001 2012 2014
[[7]]
[1] 2007 2012 2014
[[8]]
[1] 2001 2004 2010
[[9]]
[1] 2001 2011 2014
[[10]]
[1] 2001 2010 2012 2014
不同的是,你的输出还给出了
2001 2005 2014
由于 kmodes
是一个分类器,我们不能指望完美拟合。所以这是所有预测算法都预期的错误分类错误。但有趣的是,你的功能也错过了,
2001 2004 2010
这也是一种解决方法。
我编写了一个 coded_best_intersect
函数,该函数依赖于在 code_maker
函数中动态创建 for 循环。它在 30 秒内评估 M3
。因为代码生成了一个列表,所以我依赖于 data.table
for rbindlist
和 print 方法。
library(data.table)
code_maker
函数:
code_maker <- function(non_NA_M, n, k, min.col) {
## initializing for results
res <- list()
z <- 1
## initializing naming
col_names <- colnames(non_NA_M)
i_s <- paste0('i', seq_len(k))
## create the foor loop text. It looks like this mostly
## for (i1 in 1:(n - k + 1)) { for (i2 in (i1 + 1):(n-k+2)) {}}
for_loop <- paste0('for (', i_s, ' in ', c('1:', paste0('(', i_s[-k], ' + 1):')),
n - k + seq_len(k), ')', ' {\n non_na_sums', seq_len(k),
'=non_NA_M[', i_s, ', ] ',
c('', paste0('& ', rep('non_na_sums', k - 1), seq_len(k)[-k])), '',
'\n if (sum(non_na_sums', seq_len(k), ') < ', min.col, ') {next} ',
collapse='\n')
## create the assignment back to the results which looks like
## res[[z]] <- data.table(M=k, N=sum(non_na_sumsk), ROWS=list(c(i1, i2, ..., ik)),
## YEARS=list(col_names[non_na_sumsk]))
inner_text <- paste0('\nres[[z]] <- data.table(M=k, N=sum(non_na_sums',
k, '), ROWS=list(c( ', paste0(i_s, collapse=', '),
')), YEARS=list(col_names[non_na_sums', k , ']))\nz <- z + 1')
## combines the loop parts and closes the for with }}}
for_loop <- paste(for_loop,
inner_text,
paste0(rep('}', k), collapse=''))
## evaluate - the evaluation will assign back to res[[i]]
eval(parse(text=for_loop))
res <- rbindlist(res)
if (length(res) == 0) { #to return emtpy data.table with the correct fields
return(data.table(M=integer(), N=integer(), ROWS=list(), YEARS=list()))
}
res$M <- k
return(res)
}
coded_best_intersect
函数:
coded_best_intersect <- function(M, min.row=5, min.col=3) {
colnames(M) <- apply(M, 2, function(x) na.omit(x)[1])
n_row <- nrow(M)
non_NA <- !is.na(M)
n_combos <- min.row:(n_row - 1)
res2 <- list()
for (i in seq_along(n_combos)) {
res2[[i]] <- code_maker(non_NA, n=n_row, k=n_combos[i], min.col)
if (nrow(res2[[i]]) == 0) {
break
}
}
return(res2)
}
例如k=5
:
动态生成的代码
# for (i1 in 1:5) {
# non_na_sums1=non_NA_M[i1, ]
# if (sum(non_na_sums1) < 3) {next}
# for (i2 in (i1 + 1):6) {
# non_na_sums2=non_NA_M[i2, ] & non_na_sums1
# if (sum(non_na_sums2) < 3) {next}
# for (i3 in (i2 + 1):7) {
# non_na_sums3=non_NA_M[i3, ] & non_na_sums2
# if (sum(non_na_sums3) < 3) {next}
# for (i4 in (i3 + 1):8) {
# non_na_sums4=non_NA_M[i4, ] & non_na_sums3
# if (sum(non_na_sums4) < 3) {next}
# for (i5 in (i4 + 1):9) {
# non_na_sums5=non_NA_M[i5, ] & non_na_sums4
# if (sum(non_na_sums5) < 3) {next}
# for (i6 in (i5 + 1):10) {
# non_na_sums6=non_NA_M[i6, ] & non_na_sums5
# if (sum(non_na_sums6) < 3) {next}
# res[[z]] <- data.table(M=k, N=sum(non_na_sums6),
# ROWS=list(c( i1, i2, i3, i4, i5, i6)),
# YEARS=list(col_names[non_na_sums6]))
# z <- z + 1 }}}}}}
您可能会注意到 {next}
,这是一种在无法获得至少 3 列的情况下跳过组合的方法。虽然看起来都是硬编码的,但代码实际上是生成、解析然后求值的字符串。
用法和性能
矩阵M1
:
system.time(final1 <- coded_best_intersect(M1))
user system elapsed
0 0 0
data.table::rbindlist(final1)[order(-M*N)]
M N ROWS YEARS
1: 5 3 2, 4, 8, 9,10 2002,2004,2010
矩阵M2
:
system.time(final2 <- coded_best_intersect(M2))
user system elapsed
0.08 0.00 0.08
data.table::rbindlist(final2)[order(-M*N)]
M N ROWS YEARS
1: 7 3 6, 8,11,12,13,16,... 2002,2012,2013
2: 5 4 6, 8,13,16,17 2002,2012,2013,2015
3: 5 4 8,11,12,13,17 2002,2012,2013,2014
4: 6 3 1, 4, 8,13,17,20 2002,2014,2015
5: 6 3 2, 5, 6,10,14,17 2003,2006,2008
---
126: 5 3 10,12,13,17,20 2002,2008,2014
127: 5 3 10,12,14,17,20 2003,2008,2014
128: 5 3 11,12,13,16,17 2002,2012,2013
129: 5 3 11,12,13,17,20 2002,2012,2014
130: 5 3 12,13,15,16,19 2001,2002,2013
矩阵M3
:
system.time(final3 <- coded_best_intersect(M3))
user system elapsed
29.37 0.05 29.54
data.table::rbindlist(final3)[order(-M*N)]
M N ROWS YEARS
1: 6 7 1, 3, 8,15,20,29 1969,1973,1980,1984,1985,1992,...
2: 5 8 1, 3, 8,14,29 1969,1973,1976,1980,1984,1987,...
3: 5 8 1, 3, 8,20,29 1969,1973,1980,1984,1985,1992,...
4: 5 8 2, 7, 9,13,17 1974,1993,1994,2004,2012,2013,...
5: 5 8 3, 6, 8, 9,27 1974,1980,1984,1987,1995,1998,...
---
52374: 5 3 23,24,25,30,31 1979,1997,2002
52375: 5 3 23,25,28,30,31 1979,1992,2002
52376: 5 3 24,25,26,30,31 1983,1997,2002
52377: 5 3 24,25,28,30,31 1979,1983,2002
52378: 5 3 24,26,28,30,31 1983,1986,2002
要将结果的选定部分放入字符串中,您可以执行例如以下:
x <- data.table::rbindlist(final3)[order(-M*N)]
el(x$YEARS[1]) # select `YEARS` of result-row `1:`
# [1] "1969" "1973" "1980" "1984" "1985" "1992" "2003"
注意:查看其他两种截然不同的方法的编辑历史。第一个是 melt
和 join 技术,它会破坏内存。第二种方法是使用 RcppAlgos::comboGeneral
来计算函数。
这是一个使用混合整数规划的小问题,即使使用像 glpk
这样的弱开源求解器也可以很快解决。我正在使用 ompr
包进行数学建模(有关 ompr 的更多信息),并将模型逻辑作为注释包含在代码中。请注意,由于我猜 R 版本不同,我的随机数据与 OP 的不同。
当模型设置为最多 15 年的数据最大化时,M3
的总 运行 时间约为一分钟(即实际求解时间甚至更少)。此方法可以轻松扩展到更大的实例。
library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)
set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA
m <- +!is.na(M3) # gets logical matrix; 0 if NA else 1
nr <- nrow(m)
nc <- ncol(m)
n_years <- 15
model <- MIPModel() %>%
# keep[i,j] is 1 if matrix cell [i,j] is to be kept else 0
add_variable(keep[i,j], i = 1:nr, j = 1:nc, typ = "binary") %>%
# rm_row[i] is 1 if row i is selected for removal else 0
add_variable(rm_row[i], i = 1:nr, type = "binary") %>%
# rm_col[j] is 1 if column j is selected for removal else 0
add_variable(rm_col[j], j = 1:nc, type = "binary") %>%
# maximize good cells kept
set_objective(sum_expr(keep[i,j], i = 1:nr, j = 1:nc), "max") %>%
# cell can be kept only when row is not selected for removal
add_constraint(sum_expr(keep[i,j], j = 1:nc) <= 1 - rm_row[i], i = 1:nr) %>%
# cell can be kept only when column is not selected for removal
add_constraint(sum_expr(keep[i,j], i = 1:nr) <= 1 - rm_col[j], j = 1:nc) %>%
# only non-NA values can be kept
add_constraint(m[i,j] + rm_row[i] + rm_col[j] >= 1, i = 1:nr, j = 1:nc) %>%
# keep at most n_years columns i.e. remove at least (nc - n_years) columns
# I used >= instead of == to avoid infeasiblity
add_constraint(sum_expr(rm_col[j], j = 1:nc) >= nc - n_years) %>%
# solve using free glpk solver
solve_model(with_ROI(solver = "glpk"))
结果-
solver_status(model)
# [1] "optimal" <- indicates guaranteed optimum (at least one of the many possible)
# get rows to remove
rm_rows <- model %>%
get_solution(rm_row[i]) %>%
filter(value > 0) %>% pull(i) %>% print()
# [1] 1 2 3 4 6 8 9 11 12 13 14 15 17 18 19 20 21 22 23 25 27 28 29 30 31
# get columns to remove
rm_cols <- model %>%
get_solution(rm_col[j]) %>%
filter(value > 0) %>% pull(j) %>% print()
# [1] 2 3 4 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# [24] 27 28 29 30 31 32 33 34 35 36 38 39 40 41 44 45 46 47 48
result <- M3[-rm_rows, -rm_cols, drop = F]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1969 1974 1994 2005 2010 2011
[2,] 1969 1974 1994 2005 2010 2011
[3,] 1969 1974 1994 2005 2010 2011
[4,] 1969 1974 1994 2005 2010 2011
[5,] 1969 1974 1994 2005 2010 2011
[6,] 1969 1974 1994 2005 2010 2011
在矩阵中,例如M1
,行是国家,列是年份。这些国家没有同一年的观测资料。我想找到给我最多国家的“最佳”年份交叉点。最低年限和最低国家/地区的数量将预先确定。结果中包括哪些国家并不重要,年份不必是连续的。
> M1
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] NA NA NA 2004 NA 2006 NA 2008 2009 NA 2011 2012 NA NA NA
[2,] NA 2002 NA 2004 NA NA 2007 NA NA 2010 2011 NA 2013 2014 NA
[3,] NA NA NA 2004 2005 2006 2007 2008 2009 NA NA 2012 2013 NA 2015
[4,] NA 2002 NA 2004 2005 2006 2007 2008 NA 2010 2011 NA 2013 NA NA
[5,] 2001 NA NA NA 2005 2006 2007 2008 NA 2010 NA 2012 2013 2014 NA
[6,] 2001 NA 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 NA 2014 NA
[7,] 2001 2002 NA NA 2005 NA 2007 NA 2009 NA 2011 NA NA 2014 2015
[8,] 2001 2002 NA 2004 2005 2006 NA NA NA 2010 NA NA 2013 NA 2015
[9,] NA 2002 NA 2004 2005 NA 2007 NA NA 2010 2011 NA NA NA NA
[10,] 2001 2002 NA 2004 NA NA NA NA NA 2010 NA 2012 NA 2014 2015
因为没有明显的交叉点,所以一次 Reduce(intersect...)
尝试是行不通的,我通过连续排除一个国家达到定义的阈值 n.row
来重复这样做。结果至少过滤了 n.col
年。我写了这个函数,
findBestIntersect <- function(M, min.row=5, min.col=3) {
## min.row: minimum number of rows (countries) to analyze
## min.col: minimum number of complete columns (years)
# put matrices with row combn into list (HUGE!)
L1 <- lapply(min.row:(nrow(M) - 1), function(x)
combn(nrow(M), x, function(i) M[i, ], simplify=FALSE))
# select lists w/ def. number of complete columns
slc <- sapply(L1, function(y) # numbers of lists
which(sapply(y, function(x)
sum(!(apply(x, 2, function(i) any(is.na(i))))))
>= min.col))
# list selected lists
L2 <- Map(function(x, i)
x[i], L1[lengths(slc) > 0], slc[lengths(slc) > 0])
# find intersects
L3 <- rapply(L2, function(l)
as.integer(na.omit(Reduce(intersect, as.list(as.data.frame(t(l)))))),
how="list")
return(unique(unlist(L3, recursive=FALSE)))
}
这让我很快就得到了 M1
想要的结果。
> system.time(best.yrs.1 <- findBestIntersect(M1))
user system elapsed
0.06 0.00 0.07
> best.yrs.1
[[1]]
[1] 2002 2004 2010
但是 M2
的性能只是勉强可以接受(RAM 使用量约为 1.1 GB),
> system.time(best.yrs.2 <- findBestIntersect(M2))
user system elapsed
79.90 0.39 82.76
> head(best.yrs.2, 3)
[[1]]
[1] 2002 2009 2015
[[2]]
[1] 2002 2014 2015
[[3]]
[1] 2003 2009 2010
并且您不想使用类似于我的真实矩阵的 M3
(爆炸 32 GB RAM)来尝试此操作:
# best.yrs.3 <- findBestIntersect(M3)
函数最大的缺陷可能是 L1
变得太大非常快。
所以,我的问题是,是否有更好的方法也适用于M3
? “奖金”将最大化国家和年份。如果可能的话,我想在没有额外包的情况下做到这一点。
数据
set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 150, replace=TRUE), 10)
M1 <- t(replicate(10, 2001:2015, simplify=TRUE))
M1[tf] <- NA
tf <- matrix(sample(c(TRUE, FALSE), 300, replace=TRUE), 20)
M2 <- t(replicate(20, 2001:2015, simplify=TRUE))
M2[tf] <- NA
tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA
由于组合使用了大量内存而没有给出任何结果(至少在我的电脑上它给出了错误),也许聚类数据可以提供解决方案。
数据:set.seed(42)
> M1
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] 2001 NA 2003 2004 NA NA 2007 NA 2009 2010 2011 2012 NA 2014 NA
[2,] 2001 2002 NA 2004 NA NA 2007 NA NA NA NA 2012 NA 2014 NA
[3,] NA 2002 2003 NA NA NA 2007 NA NA NA NA 2012 2013 2014 NA
[4,] 2001 NA 2003 2004 2005 2006 2007 NA 2009 2010 NA NA 2013 2014 2015
[5,] 2001 NA NA NA NA NA 2007 NA 2009 2010 2011 2012 2013 2014 2015
[6,] 2001 2002 2003 2004 2005 2006 NA 2008 2009 2010 2011 2012 NA 2014 2015
[7,] 2001 2002 NA NA 2005 2006 NA NA NA NA 2011 NA 2013 2014 NA
[8,] NA NA 2003 NA 2005 NA 2007 NA NA 2010 2011 NA 2013 NA 2015
[9,] 2001 NA NA 2004 2005 NA 2007 2008 NA 2010 2011 2012 NA 2014 NA
[10,] 2001 2002 2003 2004 2005 2006 NA NA NA 2010 NA 2012 2013 2014 NA
我使用 klaR
库来使用 kmodes
函数,该函数用于对分类变量进行聚类(类似于 kmeans
)
函数:
library(klaR)
library(matrixStats)
opt <- function(data,iter) {
my_list <- list() # It will be assigned by the cluster outputs
t_M <- t(!is.na(data)) # Transforming the data into TRUE/FALSE in other words 1/0
result <- lapply(1:iter, function(rand) { # This is for assigning new seeds.
set.seed(rand^2)
for(i in 2:nrow(t_M)-1) { # Loop for the Number of the clusters
# cluster function
cluster.results <-kmodes(t_M, i, iter.max = 100, weighted = FALSE , fast = TRUE)
# Binding the data and clusters output. And adding index for finding the location
df <- as.data.frame(cbind(Index=1:nrow(t_M),Cluster=cluster.results$cluster,t_M))
# Below is the calculation step. It multiplies the columns. The idea is here if
# there exist an intersection, the product must be 1. Otherwise it is 0.
calc <- sapply(1:i , function(x){
if(nrow(df[df[,"Cluster"]==x,])<=1) {
quantity <- 0
}else{
quantity <- sum(colProds(as.matrix(df[df[,"Cluster"]==x,][,- c(1,2)])))
}
})
# Creating a new df named out, just for arranging the output.
out <- data.frame(Cluster= 1:i ,
Quantity= calc,
Size= cluster.results$size,
Iter = paste0(rand,"-",i) )
# Merging our first dataframw (df) with out (without unimportant columns)
my_list[[i]] <- merge(df[, c("Cluster","Index")],out,by="Cluster")
}
# Binding all outputs in my_list. It includes all clusters from 1:nrow(t_M)
do.call(rbind,my_list)
})
# Since we have done the same steps as much as the Iter input, This is for binding all.
# So it includes, iter * clusteroutput data
result <- do.call(rbind,result)
# Neglecting unnecessary columns
result <- result[,-(ncol(result)-2)]
colnames(result) <- c("Cluster","Index","Matching","Years","Iter")
return(result)
}
聚类过于依赖种子。因此,当播种发生变化时,聚类输出可能会发生变化。出于这个原因,我计算了关于不同种子的可能集群。这里的想法是检测相似的组并在不深入所有组合的情况下对它们进行计算。
#100 is the number of the seeds, when it increases the calculation time also increases.
#For calculating M3 matrix, it may be good to decrease the seed amount.
#However, it may decrease the quality of the clustering. (Trade off)
my_result <- opt(M1,100) # It takes about one minute in my computer.
tail(my_result)
Cluster Index Matching Years Iter
22495 10 1 0 1 100-15
22496 11 13 0 1 100-15
22497 12 2 0 1 100-15
22498 13 10 0 1 100-15
22499 14 5 0 1 100-15
22500 15 8 0 1 100-15
# Years : Number of the Years in the solution
# Matching : Number of the matches in a cluster
# Cluster : The Cluster Number
# Iter : Iteration of Cluster
# Index : The index of where the cluster is located at the input dataframe
从此以后,就变成了数据操作的工作。可以进行任何过滤。例如,假设我只想查看具有最大匹配的 3 年输出。所以,
out_list1 <- my_result %>% filter(Years==3) %>%
filter(Matching==max(Matching)) %>%
group_by(Iter,Cluster,add = TRUE) %>%
select(Index) %>% group_split()
得到out_list
后,使用下面的函数可以达到想要的输出。
函数:
find_match <- function(output,data) {
x <- unique(lapply(1:length(output),function(k) {
sort(t(output[[k]][3]))
}))
return(lapply(1:length(x),function(i) na.omit(unique(data[,x[[i]]]))[1,]))
}
如果我们调用 find_match
,
find_match(out_list1,M1)
[[1]]
[1] 2001 2004 2014
[[2]]
[1] 2001 2010 2014
[[3]]
[1] 2001 2012 2014
另一个试验: 复制您的输入findBestIntersect(M1)
out_list2 <- my_result %>% filter(Years>=3) %>%
filter(Matching>=5) %>%
group_by(Iter,Cluster,add = TRUE) %>%
select(Index) %>% group_split()
find_match(out_list2,M1)
[[1]]
[1] 2001 2010 2014
[[2]]
[1] 2001 2004 2010 2014
[[3]]
[1] 2001 2004 2014
[[4]]
[1] 2001 2004 2012 2014
[[5]]
[1] 2001 2007 2014
[[6]]
[1] 2001 2012 2014
[[7]]
[1] 2007 2012 2014
[[8]]
[1] 2001 2004 2010
[[9]]
[1] 2001 2011 2014
[[10]]
[1] 2001 2010 2012 2014
不同的是,你的输出还给出了
2001 2005 2014
由于 kmodes
是一个分类器,我们不能指望完美拟合。所以这是所有预测算法都预期的错误分类错误。但有趣的是,你的功能也错过了,
2001 2004 2010
这也是一种解决方法。
我编写了一个 coded_best_intersect
函数,该函数依赖于在 code_maker
函数中动态创建 for 循环。它在 30 秒内评估 M3
。因为代码生成了一个列表,所以我依赖于 data.table
for rbindlist
和 print 方法。
library(data.table)
code_maker
函数:
code_maker <- function(non_NA_M, n, k, min.col) {
## initializing for results
res <- list()
z <- 1
## initializing naming
col_names <- colnames(non_NA_M)
i_s <- paste0('i', seq_len(k))
## create the foor loop text. It looks like this mostly
## for (i1 in 1:(n - k + 1)) { for (i2 in (i1 + 1):(n-k+2)) {}}
for_loop <- paste0('for (', i_s, ' in ', c('1:', paste0('(', i_s[-k], ' + 1):')),
n - k + seq_len(k), ')', ' {\n non_na_sums', seq_len(k),
'=non_NA_M[', i_s, ', ] ',
c('', paste0('& ', rep('non_na_sums', k - 1), seq_len(k)[-k])), '',
'\n if (sum(non_na_sums', seq_len(k), ') < ', min.col, ') {next} ',
collapse='\n')
## create the assignment back to the results which looks like
## res[[z]] <- data.table(M=k, N=sum(non_na_sumsk), ROWS=list(c(i1, i2, ..., ik)),
## YEARS=list(col_names[non_na_sumsk]))
inner_text <- paste0('\nres[[z]] <- data.table(M=k, N=sum(non_na_sums',
k, '), ROWS=list(c( ', paste0(i_s, collapse=', '),
')), YEARS=list(col_names[non_na_sums', k , ']))\nz <- z + 1')
## combines the loop parts and closes the for with }}}
for_loop <- paste(for_loop,
inner_text,
paste0(rep('}', k), collapse=''))
## evaluate - the evaluation will assign back to res[[i]]
eval(parse(text=for_loop))
res <- rbindlist(res)
if (length(res) == 0) { #to return emtpy data.table with the correct fields
return(data.table(M=integer(), N=integer(), ROWS=list(), YEARS=list()))
}
res$M <- k
return(res)
}
coded_best_intersect
函数:
coded_best_intersect <- function(M, min.row=5, min.col=3) {
colnames(M) <- apply(M, 2, function(x) na.omit(x)[1])
n_row <- nrow(M)
non_NA <- !is.na(M)
n_combos <- min.row:(n_row - 1)
res2 <- list()
for (i in seq_along(n_combos)) {
res2[[i]] <- code_maker(non_NA, n=n_row, k=n_combos[i], min.col)
if (nrow(res2[[i]]) == 0) {
break
}
}
return(res2)
}
例如k=5
:
# for (i1 in 1:5) {
# non_na_sums1=non_NA_M[i1, ]
# if (sum(non_na_sums1) < 3) {next}
# for (i2 in (i1 + 1):6) {
# non_na_sums2=non_NA_M[i2, ] & non_na_sums1
# if (sum(non_na_sums2) < 3) {next}
# for (i3 in (i2 + 1):7) {
# non_na_sums3=non_NA_M[i3, ] & non_na_sums2
# if (sum(non_na_sums3) < 3) {next}
# for (i4 in (i3 + 1):8) {
# non_na_sums4=non_NA_M[i4, ] & non_na_sums3
# if (sum(non_na_sums4) < 3) {next}
# for (i5 in (i4 + 1):9) {
# non_na_sums5=non_NA_M[i5, ] & non_na_sums4
# if (sum(non_na_sums5) < 3) {next}
# for (i6 in (i5 + 1):10) {
# non_na_sums6=non_NA_M[i6, ] & non_na_sums5
# if (sum(non_na_sums6) < 3) {next}
# res[[z]] <- data.table(M=k, N=sum(non_na_sums6),
# ROWS=list(c( i1, i2, i3, i4, i5, i6)),
# YEARS=list(col_names[non_na_sums6]))
# z <- z + 1 }}}}}}
您可能会注意到 {next}
,这是一种在无法获得至少 3 列的情况下跳过组合的方法。虽然看起来都是硬编码的,但代码实际上是生成、解析然后求值的字符串。
用法和性能
矩阵M1
:
system.time(final1 <- coded_best_intersect(M1))
user system elapsed
0 0 0
data.table::rbindlist(final1)[order(-M*N)]
M N ROWS YEARS
1: 5 3 2, 4, 8, 9,10 2002,2004,2010
矩阵M2
:
system.time(final2 <- coded_best_intersect(M2))
user system elapsed
0.08 0.00 0.08
data.table::rbindlist(final2)[order(-M*N)]
M N ROWS YEARS
1: 7 3 6, 8,11,12,13,16,... 2002,2012,2013
2: 5 4 6, 8,13,16,17 2002,2012,2013,2015
3: 5 4 8,11,12,13,17 2002,2012,2013,2014
4: 6 3 1, 4, 8,13,17,20 2002,2014,2015
5: 6 3 2, 5, 6,10,14,17 2003,2006,2008
---
126: 5 3 10,12,13,17,20 2002,2008,2014
127: 5 3 10,12,14,17,20 2003,2008,2014
128: 5 3 11,12,13,16,17 2002,2012,2013
129: 5 3 11,12,13,17,20 2002,2012,2014
130: 5 3 12,13,15,16,19 2001,2002,2013
矩阵M3
:
system.time(final3 <- coded_best_intersect(M3))
user system elapsed
29.37 0.05 29.54
data.table::rbindlist(final3)[order(-M*N)]
M N ROWS YEARS
1: 6 7 1, 3, 8,15,20,29 1969,1973,1980,1984,1985,1992,...
2: 5 8 1, 3, 8,14,29 1969,1973,1976,1980,1984,1987,...
3: 5 8 1, 3, 8,20,29 1969,1973,1980,1984,1985,1992,...
4: 5 8 2, 7, 9,13,17 1974,1993,1994,2004,2012,2013,...
5: 5 8 3, 6, 8, 9,27 1974,1980,1984,1987,1995,1998,...
---
52374: 5 3 23,24,25,30,31 1979,1997,2002
52375: 5 3 23,25,28,30,31 1979,1992,2002
52376: 5 3 24,25,26,30,31 1983,1997,2002
52377: 5 3 24,25,28,30,31 1979,1983,2002
52378: 5 3 24,26,28,30,31 1983,1986,2002
要将结果的选定部分放入字符串中,您可以执行例如以下:
x <- data.table::rbindlist(final3)[order(-M*N)]
el(x$YEARS[1]) # select `YEARS` of result-row `1:`
# [1] "1969" "1973" "1980" "1984" "1985" "1992" "2003"
注意:查看其他两种截然不同的方法的编辑历史。第一个是 melt
和 join 技术,它会破坏内存。第二种方法是使用 RcppAlgos::comboGeneral
来计算函数。
这是一个使用混合整数规划的小问题,即使使用像 glpk
这样的弱开源求解器也可以很快解决。我正在使用 ompr
包进行数学建模(有关 ompr 的更多信息),并将模型逻辑作为注释包含在代码中。请注意,由于我猜 R 版本不同,我的随机数据与 OP 的不同。
当模型设置为最多 15 年的数据最大化时,M3
的总 运行 时间约为一分钟(即实际求解时间甚至更少)。此方法可以轻松扩展到更大的实例。
library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)
set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA
m <- +!is.na(M3) # gets logical matrix; 0 if NA else 1
nr <- nrow(m)
nc <- ncol(m)
n_years <- 15
model <- MIPModel() %>%
# keep[i,j] is 1 if matrix cell [i,j] is to be kept else 0
add_variable(keep[i,j], i = 1:nr, j = 1:nc, typ = "binary") %>%
# rm_row[i] is 1 if row i is selected for removal else 0
add_variable(rm_row[i], i = 1:nr, type = "binary") %>%
# rm_col[j] is 1 if column j is selected for removal else 0
add_variable(rm_col[j], j = 1:nc, type = "binary") %>%
# maximize good cells kept
set_objective(sum_expr(keep[i,j], i = 1:nr, j = 1:nc), "max") %>%
# cell can be kept only when row is not selected for removal
add_constraint(sum_expr(keep[i,j], j = 1:nc) <= 1 - rm_row[i], i = 1:nr) %>%
# cell can be kept only when column is not selected for removal
add_constraint(sum_expr(keep[i,j], i = 1:nr) <= 1 - rm_col[j], j = 1:nc) %>%
# only non-NA values can be kept
add_constraint(m[i,j] + rm_row[i] + rm_col[j] >= 1, i = 1:nr, j = 1:nc) %>%
# keep at most n_years columns i.e. remove at least (nc - n_years) columns
# I used >= instead of == to avoid infeasiblity
add_constraint(sum_expr(rm_col[j], j = 1:nc) >= nc - n_years) %>%
# solve using free glpk solver
solve_model(with_ROI(solver = "glpk"))
结果-
solver_status(model)
# [1] "optimal" <- indicates guaranteed optimum (at least one of the many possible)
# get rows to remove
rm_rows <- model %>%
get_solution(rm_row[i]) %>%
filter(value > 0) %>% pull(i) %>% print()
# [1] 1 2 3 4 6 8 9 11 12 13 14 15 17 18 19 20 21 22 23 25 27 28 29 30 31
# get columns to remove
rm_cols <- model %>%
get_solution(rm_col[j]) %>%
filter(value > 0) %>% pull(j) %>% print()
# [1] 2 3 4 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# [24] 27 28 29 30 31 32 33 34 35 36 38 39 40 41 44 45 46 47 48
result <- M3[-rm_rows, -rm_cols, drop = F]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1969 1974 1994 2005 2010 2011
[2,] 1969 1974 1994 2005 2010 2011
[3,] 1969 1974 1994 2005 2010 2011
[4,] 1969 1974 1994 2005 2010 2011
[5,] 1969 1974 1994 2005 2010 2011
[6,] 1969 1974 1994 2005 2010 2011