可以使用 AWK 数组来获得相关矩阵中的最大簇吗?
Can AWK array be used to get largest cluster in correlation matrix?
我有一个描述项目 A-K 之间相关性的矩阵,其中 1=相关,0=不相关。
有没有一种简单的方法可以从数据中提取最大的簇?换句话说,具有最相关元素的集群。以下是一些示例数据:
# A B C D E F G H I J K
A 1 1 1 1 1 1 1 1 1 1 1
B 1 1 1 1 1 1 1 1 1 1 1
C 1 1 1 1 1 1 1 1 1 1 1
D 1 1 1 1 0 1 0 0 0 1 1
E 1 1 1 0 1 0 0 1 1 0 0
F 1 1 1 1 0 1 0 0 1 0 1
G 1 1 1 0 0 0 1 0 0 0 0
H 1 1 1 0 1 0 0 1 1 1 1
I 1 1 1 0 1 1 0 1 1 0 0
J 1 1 1 1 0 0 0 1 0 1 0
K 1 1 1 1 0 1 0 1 0 0 1
用眼睛交换几个 columns/rows,预期的结果将是矩阵的左上角,这是一个大小为 6 的簇,其中包含:{A, B, C, D, F, K }
我知道 awk 对于这个应用程序来说不是最用户友好的,但我热衷于使用 awk,因为它将集成到更大的 awk 脚本中。话虽这么说,我对这门语言并不是完全不动心。
不确定从哪里开始,但这是我在 python 中的想法的更复杂版本:
https://stats.stackexchange.com/questions/138325/clustering-a-correlation-matrix
假设:
- 所有矩阵都是
symmetric
(即平方;等于它的转置;矩阵[x,y]=矩阵[y,x])
matrix[x,x]=1
所有 x
- 所有矩阵条目都是
0
或1
- 对 1 元素簇不感兴趣
- 对同一簇的排列不感兴趣(即,
A,B
与 B,A
相同)
- 因为我们不必担心排列,我们可以专注于按照元素在矩阵中出现的顺序处理元素(例如,我们处理
A,B,C
并忽略 [=22 的等价物=]、B,A,C
、B,C,A
、C,A,B
和 C,B,A
);这使我们能够专注于处理矩阵的 top/right 一半(在 identity/diagonal 之上)并按从左到右的顺序处理;这将大大减少我们需要评估的排列数量
- 正如问题中所证明的那样,组成一个簇的元素可以在矩阵中移动 up/left 以便用
1's
填充矩阵的 top/left (这来自在处理过程中发挥作用,对于每个新元素,我们只需要测试添加到矩阵的此 top/left 部分的新 column/row 的等价物)
关于最后一个假设...假设我们有集群 A,D
并且我们现在要测试 A,D,F
;我们只需要测试新的 column/row 条目 (?
):
Current Cluster New Cluster ?
A D A D F
A 1 1 A 1 1 ? # if matrix is symmetric then only need to test
D 1 1 D 1 1 ? # the new column *OR* the new row, not both;
F ? ? 1 # bottom/right == 1 == matrix[F,F] per earlier assumption
一个想法使用递归函数和两个 GNU awk's
特征:a) 数组数组(也称为多维数组)和 b ) PROCINFO["sorted_in"]
用于将簇自定义排序到标准输出:
awk '
######
# load matrix into memory
FNR==1 { n=(NF-1) # number of elements to be processed
for (i=2;i<=NF;i++)
label[i-1]=$i # save labels
next
}
{ for (i=2;i<=NF;i++)
m[FNR-1][i-1]=$i # populate matrix array m[row#][column#]
}
######
# define our recursive function
function find_cluster(cluster, i, clstrcount, stackseq, j, k, corrcount) {
# cluster : current working cluster (eg, "A,B,C")
# i : index of latest element (eg, for "A,B,C" => latest element is "C" so i = 3
# clstrcount : number of elements in current cluster
# stackseq : sequence number of stack[] array
# : stack[] contains list of indexes for current cluster (for "A,B,C" stack = "1,2,3")
# j,k,corrcount : declaring additional variables as "local" to this invocation of the function
clstrcount++ # number of elements to be processed at this call/level
for (j=i+1;j<=n;j++) { # process all elements/indexes greater than i
corrcount=1 # reset correlation count; always start with 1 since m[j][j]=1
# check the new column/row added to the top/left of the matrix to see if it extends the current cluster (ie, all entries are "1")
for (k in stack) { # loop through element/indexes in stack
if (m[stack[k]][j]) # check column entries
corrcount++
if (m[j][stack[k]]) # check row entries; not necessary if matrix is symmetric but we will add it here to show the m[][] references
corrcount++
}
if (corrcount == (stackseq*2 +1) ) { # if we have all "1"s we have a new cluster of size clstrcount
stack[++stackseq]=j # "push" current element/index on stack; increment stack seq/index
cluster=cluster "," label[j] # add current element/label to cluster
max= (clstrcount>max) ? clstrcount : max # new max(cluster count) ?
clusters[clstrcount][++clsterseq]=cluster # add new cluster to our master list: clusters[cluster_count][seq]
find_cluster(cluster, j, clstrcount, stackseq) # recursive call to check for next element(s)
delete stack[stackseq--] # back from recursive call so "pop" curent element (j) from stack
gsub(/[,][^,]+$/,"",cluster) # remove current element/label from cluster to make way for next element/label to be tested
}
}
}
######
# start looking for clusters of size 2+
END { max=2 # not interested in clusters of "1"
for (i=1;i<n;i++) { # loop through list of elements
clstrcount=1 # init cluster count = 1
clstrseq=0 # init clusters[...][seq] sequence seed
cluster=label[i] # reset cluster to current element/label
stackseq=1 # reset stack[seq] sequence seed
stack[stackseq]=i # "push" current element on stack
find_cluster(cluster, i, clstrcount, stackseq) # start recursive calls looking for next element in cluster
}
######
# for now just display clusters with size > 2; adjust this next line to add/remove cluster sizes from stdout
if (max>2) # print list of clusters with length > 2
for (i=max;i>2;i--) { # print from largest to smallest and ...
PROCINFO["sorted_in"]="@val_str_asc" # in alphabetical order
printf "####### clusters of size %s:\n", i
for (j in clusters[i]) # loop through all entries for clusters of size "i"
print clusters[i][j]
}
}
' matrix.dat
注意: 当前版本(无可否认)有点冗长……这是我在研究细节时记下第一遍解决方案的结果;通过进一步分析,可以减少代码;话虽如此,在这个 11 元素矩阵中找到所有 2+ 大小的簇所花费的时间还算不错:
real 0m0.084s
user 0m0.031s
sys 0m0.046s
这会生成:
####### clusters of size 6:
A,B,C,D,F,K
A,B,C,E,H,I
####### clusters of size 5:
A,B,C,D,F
A,B,C,D,J
A,B,C,D,K
A,B,C,E,H
A,B,C,E,I
A,B,C,F,I
A,B,C,F,K
A,B,C,H,I
A,B,C,H,J
A,B,C,H,K
A,B,D,F,K
A,B,E,H,I
A,C,D,F,K
A,C,E,H,I
B,C,D,F,K
B,C,E,H,I
####### clusters of size 4:
A,B,C,D
A,B,C,E
A,B,C,F
A,B,C,G
A,B,C,H
A,B,C,I
A,B,C,J
A,B,C,K
A,B,D,F
A,B,D,J
A,B,D,K
A,B,E,H
A,B,E,I
A,B,F,I
A,B,F,K
A,B,H,I
A,B,H,J
A,B,H,K
A,C,D,F
A,C,D,J
A,C,D,K
A,C,E,H
A,C,E,I
A,C,F,I
A,C,F,K
A,C,H,I
A,C,H,J
A,C,H,K
A,D,F,K
A,E,H,I
B,C,D,F
B,C,D,J
B,C,D,K
B,C,E,H
B,C,E,I
B,C,F,I
B,C,F,K
B,C,H,I
B,C,H,J
B,C,H,K
B,D,F,K
B,E,H,I
C,D,F,K
C,E,H,I
####### clusters of size 3:
A,B,C
A,B,D
A,B,E
A,B,F
A,B,G
A,B,H
A,B,I
A,B,J
A,B,K
A,C,D
A,C,E
A,C,F
A,C,G
A,C,H
A,C,I
A,C,J
A,C,K
A,D,F
A,D,J
A,D,K
A,E,H
A,E,I
A,F,I
A,F,K
A,H,I
A,H,J
A,H,K
B,C,D
B,C,E
B,C,F
B,C,G
B,C,H
B,C,I
B,C,J
B,C,K
B,D,F
B,D,J
B,D,K
B,E,H
B,E,I
B,F,I
B,F,K
B,H,I
B,H,J
B,H,K
C,D,F
C,D,J
C,D,K
C,E,H
C,E,I
C,F,I
C,F,K
C,H,I
C,H,J
C,H,K
D,F,K
E,H,I
我有一个描述项目 A-K 之间相关性的矩阵,其中 1=相关,0=不相关。
有没有一种简单的方法可以从数据中提取最大的簇?换句话说,具有最相关元素的集群。以下是一些示例数据:
# A B C D E F G H I J K
A 1 1 1 1 1 1 1 1 1 1 1
B 1 1 1 1 1 1 1 1 1 1 1
C 1 1 1 1 1 1 1 1 1 1 1
D 1 1 1 1 0 1 0 0 0 1 1
E 1 1 1 0 1 0 0 1 1 0 0
F 1 1 1 1 0 1 0 0 1 0 1
G 1 1 1 0 0 0 1 0 0 0 0
H 1 1 1 0 1 0 0 1 1 1 1
I 1 1 1 0 1 1 0 1 1 0 0
J 1 1 1 1 0 0 0 1 0 1 0
K 1 1 1 1 0 1 0 1 0 0 1
用眼睛交换几个 columns/rows,预期的结果将是矩阵的左上角,这是一个大小为 6 的簇,其中包含:{A, B, C, D, F, K }
我知道 awk 对于这个应用程序来说不是最用户友好的,但我热衷于使用 awk,因为它将集成到更大的 awk 脚本中。话虽这么说,我对这门语言并不是完全不动心。
不确定从哪里开始,但这是我在 python 中的想法的更复杂版本: https://stats.stackexchange.com/questions/138325/clustering-a-correlation-matrix
假设:
- 所有矩阵都是
symmetric
(即平方;等于它的转置;矩阵[x,y]=矩阵[y,x]) matrix[x,x]=1
所有x
- 所有矩阵条目都是
0
或1
- 对 1 元素簇不感兴趣
- 对同一簇的排列不感兴趣(即,
A,B
与B,A
相同) - 因为我们不必担心排列,我们可以专注于按照元素在矩阵中出现的顺序处理元素(例如,我们处理
A,B,C
并忽略 [=22 的等价物=]、B,A,C
、B,C,A
、C,A,B
和C,B,A
);这使我们能够专注于处理矩阵的 top/right 一半(在 identity/diagonal 之上)并按从左到右的顺序处理;这将大大减少我们需要评估的排列数量 - 正如问题中所证明的那样,组成一个簇的元素可以在矩阵中移动 up/left 以便用
1's
填充矩阵的 top/left (这来自在处理过程中发挥作用,对于每个新元素,我们只需要测试添加到矩阵的此 top/left 部分的新 column/row 的等价物)
关于最后一个假设...假设我们有集群 A,D
并且我们现在要测试 A,D,F
;我们只需要测试新的 column/row 条目 (?
):
Current Cluster New Cluster ?
A D A D F
A 1 1 A 1 1 ? # if matrix is symmetric then only need to test
D 1 1 D 1 1 ? # the new column *OR* the new row, not both;
F ? ? 1 # bottom/right == 1 == matrix[F,F] per earlier assumption
一个想法使用递归函数和两个 GNU awk's
特征:a) 数组数组(也称为多维数组)和 b ) PROCINFO["sorted_in"]
用于将簇自定义排序到标准输出:
awk '
######
# load matrix into memory
FNR==1 { n=(NF-1) # number of elements to be processed
for (i=2;i<=NF;i++)
label[i-1]=$i # save labels
next
}
{ for (i=2;i<=NF;i++)
m[FNR-1][i-1]=$i # populate matrix array m[row#][column#]
}
######
# define our recursive function
function find_cluster(cluster, i, clstrcount, stackseq, j, k, corrcount) {
# cluster : current working cluster (eg, "A,B,C")
# i : index of latest element (eg, for "A,B,C" => latest element is "C" so i = 3
# clstrcount : number of elements in current cluster
# stackseq : sequence number of stack[] array
# : stack[] contains list of indexes for current cluster (for "A,B,C" stack = "1,2,3")
# j,k,corrcount : declaring additional variables as "local" to this invocation of the function
clstrcount++ # number of elements to be processed at this call/level
for (j=i+1;j<=n;j++) { # process all elements/indexes greater than i
corrcount=1 # reset correlation count; always start with 1 since m[j][j]=1
# check the new column/row added to the top/left of the matrix to see if it extends the current cluster (ie, all entries are "1")
for (k in stack) { # loop through element/indexes in stack
if (m[stack[k]][j]) # check column entries
corrcount++
if (m[j][stack[k]]) # check row entries; not necessary if matrix is symmetric but we will add it here to show the m[][] references
corrcount++
}
if (corrcount == (stackseq*2 +1) ) { # if we have all "1"s we have a new cluster of size clstrcount
stack[++stackseq]=j # "push" current element/index on stack; increment stack seq/index
cluster=cluster "," label[j] # add current element/label to cluster
max= (clstrcount>max) ? clstrcount : max # new max(cluster count) ?
clusters[clstrcount][++clsterseq]=cluster # add new cluster to our master list: clusters[cluster_count][seq]
find_cluster(cluster, j, clstrcount, stackseq) # recursive call to check for next element(s)
delete stack[stackseq--] # back from recursive call so "pop" curent element (j) from stack
gsub(/[,][^,]+$/,"",cluster) # remove current element/label from cluster to make way for next element/label to be tested
}
}
}
######
# start looking for clusters of size 2+
END { max=2 # not interested in clusters of "1"
for (i=1;i<n;i++) { # loop through list of elements
clstrcount=1 # init cluster count = 1
clstrseq=0 # init clusters[...][seq] sequence seed
cluster=label[i] # reset cluster to current element/label
stackseq=1 # reset stack[seq] sequence seed
stack[stackseq]=i # "push" current element on stack
find_cluster(cluster, i, clstrcount, stackseq) # start recursive calls looking for next element in cluster
}
######
# for now just display clusters with size > 2; adjust this next line to add/remove cluster sizes from stdout
if (max>2) # print list of clusters with length > 2
for (i=max;i>2;i--) { # print from largest to smallest and ...
PROCINFO["sorted_in"]="@val_str_asc" # in alphabetical order
printf "####### clusters of size %s:\n", i
for (j in clusters[i]) # loop through all entries for clusters of size "i"
print clusters[i][j]
}
}
' matrix.dat
注意: 当前版本(无可否认)有点冗长……这是我在研究细节时记下第一遍解决方案的结果;通过进一步分析,可以减少代码;话虽如此,在这个 11 元素矩阵中找到所有 2+ 大小的簇所花费的时间还算不错:
real 0m0.084s
user 0m0.031s
sys 0m0.046s
这会生成:
####### clusters of size 6:
A,B,C,D,F,K
A,B,C,E,H,I
####### clusters of size 5:
A,B,C,D,F
A,B,C,D,J
A,B,C,D,K
A,B,C,E,H
A,B,C,E,I
A,B,C,F,I
A,B,C,F,K
A,B,C,H,I
A,B,C,H,J
A,B,C,H,K
A,B,D,F,K
A,B,E,H,I
A,C,D,F,K
A,C,E,H,I
B,C,D,F,K
B,C,E,H,I
####### clusters of size 4:
A,B,C,D
A,B,C,E
A,B,C,F
A,B,C,G
A,B,C,H
A,B,C,I
A,B,C,J
A,B,C,K
A,B,D,F
A,B,D,J
A,B,D,K
A,B,E,H
A,B,E,I
A,B,F,I
A,B,F,K
A,B,H,I
A,B,H,J
A,B,H,K
A,C,D,F
A,C,D,J
A,C,D,K
A,C,E,H
A,C,E,I
A,C,F,I
A,C,F,K
A,C,H,I
A,C,H,J
A,C,H,K
A,D,F,K
A,E,H,I
B,C,D,F
B,C,D,J
B,C,D,K
B,C,E,H
B,C,E,I
B,C,F,I
B,C,F,K
B,C,H,I
B,C,H,J
B,C,H,K
B,D,F,K
B,E,H,I
C,D,F,K
C,E,H,I
####### clusters of size 3:
A,B,C
A,B,D
A,B,E
A,B,F
A,B,G
A,B,H
A,B,I
A,B,J
A,B,K
A,C,D
A,C,E
A,C,F
A,C,G
A,C,H
A,C,I
A,C,J
A,C,K
A,D,F
A,D,J
A,D,K
A,E,H
A,E,I
A,F,I
A,F,K
A,H,I
A,H,J
A,H,K
B,C,D
B,C,E
B,C,F
B,C,G
B,C,H
B,C,I
B,C,J
B,C,K
B,D,F
B,D,J
B,D,K
B,E,H
B,E,I
B,F,I
B,F,K
B,H,I
B,H,J
B,H,K
C,D,F
C,D,J
C,D,K
C,E,H
C,E,I
C,F,I
C,F,K
C,H,I
C,H,J
C,H,K
D,F,K
E,H,I