找到两个相关分数向量之间共享的前 N 个元素
Find the first N elements shared between two vectors of correlation scores
我有两个数据 tables,train 和 target,由行中的样本和列中的化学物质组成,table 值是样本中化学物质的相对丰度。两个数据集之间的化学物质相同。我找到了训练数据和目标数据中的值之间的 Spearman 相关性的绝对值,现在 我想找到最小的 i 使得两个数组的第一个 i 个元素包含共同的 n 个元素。
示例:假设我们正在查看化学物质 Y1,并且序列和目标与化学物质 Y1 到 Y10 的相关值为:
train
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
Y1: 1 -1 -.2 .5 -.9 .7 .1 .1 -.2 -.5
target
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
Y1: 1 .1 .2 -.7 .6 .4 .2 .5 -.5 -.2
每个绝对值的排名顺序为:
train
Y1: Y1 Y2 Y5 Y6 Y4 Y10 Y9 Y3 Y7 Y8
target
Y1: Y1 Y4 Y5 Y8 Y9 Y6 Y3 Y7 Y10 Y2
然后 train 和 target 之间的前 5 个共享元素是:
Y1: Y1, Y5, Y4, Y6, Y9
所以对于 n = 5,两个数组的前 7 个元素共有 Y1、Y5、Y4、Y6 和 Y9。比较它们的算法必须到第 7 个元素才能找到两个列表中都有的 5 个元素。最坏的情况,它必须出去到第 10 个元素。
这是我尝试过的方法:
对序列和目标的每种化学物质的绝对相关性列表进行排序,取两个列表的交集,并取结果的前 N 个元素。失败,因为 train 和 target 的化学物质相同,所以交集只是整个化学物质列表,结果的顺序仅由 train 或 target 是 intersect()
的第一个参数决定
一次走一个化学,取train和target的前N个相关分数的交集,检查交集的长度是否小于N,如果是,则取交集前 N+1 个分数,重复直到交集为 N 长。准确,但假设集合交集为 O(n),这将是一个 O(n^2) 算法。我想做得更好。当前 R 代码如下:
common = c()
num = 10
i = num
while(length(common)<(num)){
common = intersect(corr_train[2:(i+1)], corr_target[2:(i+1)])
i = i + 1
}
有什么想法吗?
该算法将使用来自 2 个集合 'train' 和'target' 大小相等 m,复杂度为 O(log(m)+m).
基本上,计算等于每个有序集合中每个元素排名的分数,并在相应元素之间进行比较。这个想法是只有当另一个列表中的相应元素排名不高时,才将一个元素添加到公共列表中。
当2个集合的不同元素相加得到相同的i时(例如n = 7
时,可选择Y7或Y10),则'train'集会被任意偏爱
#Calculate the rank of each element in each set
trainrank <- rank(train)
targetrank <- rank(target)
#Sort both sets and attribute each element their rank
trainscores <- order(trainrank)
targetscores <- order(targetrank)
#Include elements of the train set if their ranking is
# superior or equal to those of the target set
includetrain <- trainscores>=targetscores
includetrain <- includetrain[trainrank]
#Include elements of the target set if their ranking is
# strictly superior to those of the train set
includetarget <- targetscores>trainscores
includetarget <- includetarget[targetrank]
#To get a set containing n common elements
# from 2 sets of equal size m,
# this code will take 2*m operations at most
commonset <- c()
m = length(train)
n = 5
i = 1
while (length(commonset) < n){
newelement <- NA
while(i <= m & is.na(newelement)){
#If the selection of train or target elements
# gave the same i first elements,
# this would favor the train element
if(includetrain[i]){
newelement <- train[i]
includetrain[i] <- FALSE
}
else if (includetarget[i]){
newelement <- target[i]
includetarget[i] <- FALSE
}
else{
i = i+1 #Next element if both are false
}
}
commonset <- c(commonset, newelement)
}
commonset #Common set of n elements
# "Y1" "Y5" "Y4" "Y6" "Y9"
print(i) #First i elements used to build the common set
# 7
原始数据
#Train and target data sets
train <- c("Y1","Y2","Y5","Y6","Y4","Y10","Y9","Y3","Y7","Y8")
target <- c("Y1","Y4","Y5","Y8","Y9","Y6","Y3","Y7","Y10","Y2")
我有两个数据 tables,train 和 target,由行中的样本和列中的化学物质组成,table 值是样本中化学物质的相对丰度。两个数据集之间的化学物质相同。我找到了训练数据和目标数据中的值之间的 Spearman 相关性的绝对值,现在 我想找到最小的 i 使得两个数组的第一个 i 个元素包含共同的 n 个元素。
示例:假设我们正在查看化学物质 Y1,并且序列和目标与化学物质 Y1 到 Y10 的相关值为:
train
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
Y1: 1 -1 -.2 .5 -.9 .7 .1 .1 -.2 -.5
target
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
Y1: 1 .1 .2 -.7 .6 .4 .2 .5 -.5 -.2
每个绝对值的排名顺序为:
train
Y1: Y1 Y2 Y5 Y6 Y4 Y10 Y9 Y3 Y7 Y8
target
Y1: Y1 Y4 Y5 Y8 Y9 Y6 Y3 Y7 Y10 Y2
然后 train 和 target 之间的前 5 个共享元素是:
Y1: Y1, Y5, Y4, Y6, Y9
所以对于 n = 5,两个数组的前 7 个元素共有 Y1、Y5、Y4、Y6 和 Y9。比较它们的算法必须到第 7 个元素才能找到两个列表中都有的 5 个元素。最坏的情况,它必须出去到第 10 个元素。
这是我尝试过的方法:
对序列和目标的每种化学物质的绝对相关性列表进行排序,取两个列表的交集,并取结果的前 N 个元素。失败,因为 train 和 target 的化学物质相同,所以交集只是整个化学物质列表,结果的顺序仅由 train 或 target 是 intersect()
的第一个参数决定一次走一个化学,取train和target的前N个相关分数的交集,检查交集的长度是否小于N,如果是,则取交集前 N+1 个分数,重复直到交集为 N 长。准确,但假设集合交集为 O(n),这将是一个 O(n^2) 算法。我想做得更好。当前 R 代码如下:
common = c() num = 10 i = num while(length(common)<(num)){ common = intersect(corr_train[2:(i+1)], corr_target[2:(i+1)]) i = i + 1 }
有什么想法吗?
该算法将使用来自 2 个集合 'train' 和'target' 大小相等 m,复杂度为 O(log(m)+m).
基本上,计算等于每个有序集合中每个元素排名的分数,并在相应元素之间进行比较。这个想法是只有当另一个列表中的相应元素排名不高时,才将一个元素添加到公共列表中。
当2个集合的不同元素相加得到相同的i时(例如n = 7
时,可选择Y7或Y10),则'train'集会被任意偏爱
#Calculate the rank of each element in each set
trainrank <- rank(train)
targetrank <- rank(target)
#Sort both sets and attribute each element their rank
trainscores <- order(trainrank)
targetscores <- order(targetrank)
#Include elements of the train set if their ranking is
# superior or equal to those of the target set
includetrain <- trainscores>=targetscores
includetrain <- includetrain[trainrank]
#Include elements of the target set if their ranking is
# strictly superior to those of the train set
includetarget <- targetscores>trainscores
includetarget <- includetarget[targetrank]
#To get a set containing n common elements
# from 2 sets of equal size m,
# this code will take 2*m operations at most
commonset <- c()
m = length(train)
n = 5
i = 1
while (length(commonset) < n){
newelement <- NA
while(i <= m & is.na(newelement)){
#If the selection of train or target elements
# gave the same i first elements,
# this would favor the train element
if(includetrain[i]){
newelement <- train[i]
includetrain[i] <- FALSE
}
else if (includetarget[i]){
newelement <- target[i]
includetarget[i] <- FALSE
}
else{
i = i+1 #Next element if both are false
}
}
commonset <- c(commonset, newelement)
}
commonset #Common set of n elements
# "Y1" "Y5" "Y4" "Y6" "Y9"
print(i) #First i elements used to build the common set
# 7
原始数据
#Train and target data sets
train <- c("Y1","Y2","Y5","Y6","Y4","Y10","Y9","Y3","Y7","Y8")
target <- c("Y1","Y4","Y5","Y8","Y9","Y6","Y3","Y7","Y10","Y2")