可以使用 AWK 数组来获得相关矩阵中的最大簇吗?

Can AWK array be used to get largest cluster in correlation matrix?

我有一个描述项目 A-K 之间相关性的矩阵,其中 1=相关,0=不相关。

有没有一种简单的方法可以从数据中提取最大的簇?换句话说,具有最相关元素的集群。以下是一些示例数据:

#  A B C D E F G H I J K
A  1 1 1 1 1 1 1 1 1 1 1
B  1 1 1 1 1 1 1 1 1 1 1
C  1 1 1 1 1 1 1 1 1 1 1
D  1 1 1 1 0 1 0 0 0 1 1
E  1 1 1 0 1 0 0 1 1 0 0
F  1 1 1 1 0 1 0 0 1 0 1
G  1 1 1 0 0 0 1 0 0 0 0
H  1 1 1 0 1 0 0 1 1 1 1
I  1 1 1 0 1 1 0 1 1 0 0
J  1 1 1 1 0 0 0 1 0 1 0
K  1 1 1 1 0 1 0 1 0 0 1

用眼睛交换几个 columns/rows,预期的结果将是矩阵的左上角,这是一个大小为 6 的簇,其中包含:{A, B, C, D, F, K }

我知道 awk 对于这个应用程序来说不是最用户友好的,但我热衷于使用 awk,因为它将集成到更大的 awk 脚本中。话虽这么说,我对这门语言并不是完全不动心。

不确定从哪里开始,但这是我在 python 中的想法的更复杂版本: https://stats.stackexchange.com/questions/138325/clustering-a-correlation-matrix

假设:

  • 所有矩阵都是symmetric(即平方;等于它的转置;矩阵[x,y]=矩阵[y,x])
  • matrix[x,x]=1 所有 x
  • 所有矩阵条目都是01
  • 对 1 元素簇不感兴趣
  • 对同一簇的排列不感兴趣(即,A,BB,A 相同)
  • 因为我们不必担心排列,我们可以专注于按照元素在矩阵中出现的顺序处理元素(例如,我们处理 A,B,C 并忽略 [=22 的等价物=]、B,A,CB,C,AC,A,BC,B,A);这使我们能够专注于处理矩阵的 top/right 一半(在 identity/diagonal 之上)并按从左到右的顺序处理;这将大大减少我们需要评估的排列数量
  • 正如问题中所证明的那样,组成一个簇的元素可以在矩阵中移动 up/left 以便用 1's 填充矩阵的 top/left (这来自在处理过程中发挥作用,对于每个新元素,我们只需要测试添加到矩阵的此 top/left 部分的新 column/row 的等价物)

关于最后一个假设...假设我们有集群 A,D 并且我们现在要测试 A,D,F;我们只需要测试新的 column/row 条目 (?):

Current Cluster     New Cluster ?

  A D                 A D F
A 1 1               A 1 1 ?     # if matrix is symmetric then only need to test
D 1 1               D 1 1 ?     # the new column *OR* the new row, not both;
                    F ? ? 1     # bottom/right == 1 == matrix[F,F] per earlier assumption

一个想法使用递归函数和两个 GNU awk's 特征:a) 数组数组(也称为多维数组)和 b ) PROCINFO["sorted_in"] 用于将簇自定义排序到标准输出:

awk '

######
# load matrix into memory

FNR==1 { n=(NF-1)                                            # number of elements to be processed
         for (i=2;i<=NF;i++)
             label[i-1]=$i                                   # save labels
         next
       }
       { for (i=2;i<=NF;i++)
             m[FNR-1][i-1]=$i                                # populate matrix array m[row#][column#]
       }

######
# define our recursive function

function find_cluster(cluster, i, clstrcount, stackseq,    j, k, corrcount) {

    # cluster       : current working cluster (eg, "A,B,C")
    # i             : index of latest element (eg, for "A,B,C" => latest element is "C" so i = 3
    # clstrcount    : number of elements in current cluster
    # stackseq      : sequence number of stack[] array
    #               : stack[] contains list of indexes for current cluster (for "A,B,C" stack = "1,2,3")
    # j,k,corrcount : declaring additional variables as "local" to this invocation of the function

    clstrcount++                                             # number of elements to be processed at this call/level

    for (j=i+1;j<=n;j++) {                                   # process all elements/indexes greater than i

        corrcount=1                                          # reset correlation count; always start with 1 since m[j][j]=1

        # check the new column/row added to the top/left of the matrix to see if it extends the current cluster (ie, all entries are "1")

        for (k in stack) {                                   # loop through element/indexes in stack
            if (m[stack[k]][j])                              # check column entries
               corrcount++
            if (m[j][stack[k]])                              # check row entries; not necessary if matrix is symmetric but we will add it here to show the m[][] references
               corrcount++
        }

        if (corrcount == (stackseq*2 +1) ) {                 # if we have all "1"s we have a new cluster of size clstrcount

           stack[++stackseq]=j                               # "push" current element/index on stack; increment stack seq/index
           cluster=cluster "," label[j]                      # add current element/label to cluster

           max= (clstrcount>max) ? clstrcount : max          # new max(cluster count) ?
           clusters[clstrcount][++clsterseq]=cluster         # add new cluster to our master list: clusters[cluster_count][seq]

           find_cluster(cluster, j, clstrcount, stackseq)    # recursive call to check for next element(s)

           delete stack[stackseq--]                          # back from recursive call so "pop" curent element (j) from stack
           gsub(/[,][^,]+$/,"",cluster)                      # remove current element/label from cluster to make way for next element/label to be tested
        }
    }
}

######
# start looking for clusters of size 2+

END    { max=2                                               # not interested in clusters of "1"
         for (i=1;i<n;i++) {                                 # loop through list of elements
             clstrcount=1                                    # init cluster count = 1
             clstrseq=0                                      # init clusters[...][seq] sequence seed
             cluster=label[i]                                # reset cluster to current element/label
             stackseq=1                                      # reset stack[seq] sequence seed
             stack[stackseq]=i                               # "push" current element on stack

             find_cluster(cluster, i, clstrcount, stackseq)  # start recursive calls looking for next element in cluster
         }

######
# for now just display clusters with size > 2; adjust this next line to add/remove cluster sizes from stdout

         if (max>2)                                          # print list of clusters with length > 2
            for (i=max;i>2;i--) {                            # print from largest to smallest and ...
                PROCINFO["sorted_in"]="@val_str_asc"         # in alphabetical order
                printf "####### clusters of size %s:\n", i
                for (j in clusters[i])                       # loop through all entries for clusters of size "i"
                    print clusters[i][j]
            }
       }
' matrix.dat

注意: 当前版本(无可否认)有点冗长……这是我在研究细节时记下第一遍解决方案的结果;通过进一步分析,可以减少代码;话虽如此,在这个 11 元素矩阵中找到所有 2+ 大小的簇所花费的时间还算不错:

real    0m0.084s
user    0m0.031s
sys     0m0.046s

这会生成:

####### clusters of size 6:
A,B,C,D,F,K
A,B,C,E,H,I
####### clusters of size 5:
A,B,C,D,F
A,B,C,D,J
A,B,C,D,K
A,B,C,E,H
A,B,C,E,I
A,B,C,F,I
A,B,C,F,K
A,B,C,H,I
A,B,C,H,J
A,B,C,H,K
A,B,D,F,K
A,B,E,H,I
A,C,D,F,K
A,C,E,H,I
B,C,D,F,K
B,C,E,H,I
####### clusters of size 4:
A,B,C,D
A,B,C,E
A,B,C,F
A,B,C,G
A,B,C,H
A,B,C,I
A,B,C,J
A,B,C,K
A,B,D,F
A,B,D,J
A,B,D,K
A,B,E,H
A,B,E,I
A,B,F,I
A,B,F,K
A,B,H,I
A,B,H,J
A,B,H,K
A,C,D,F
A,C,D,J
A,C,D,K
A,C,E,H
A,C,E,I
A,C,F,I
A,C,F,K
A,C,H,I
A,C,H,J
A,C,H,K
A,D,F,K
A,E,H,I
B,C,D,F
B,C,D,J
B,C,D,K
B,C,E,H
B,C,E,I
B,C,F,I
B,C,F,K
B,C,H,I
B,C,H,J
B,C,H,K
B,D,F,K
B,E,H,I
C,D,F,K
C,E,H,I
####### clusters of size 3:
A,B,C
A,B,D
A,B,E
A,B,F
A,B,G
A,B,H
A,B,I
A,B,J
A,B,K
A,C,D
A,C,E
A,C,F
A,C,G
A,C,H
A,C,I
A,C,J
A,C,K
A,D,F
A,D,J
A,D,K
A,E,H
A,E,I
A,F,I
A,F,K
A,H,I
A,H,J
A,H,K
B,C,D
B,C,E
B,C,F
B,C,G
B,C,H
B,C,I
B,C,J
B,C,K
B,D,F
B,D,J
B,D,K
B,E,H
B,E,I
B,F,I
B,F,K
B,H,I
B,H,J
B,H,K
C,D,F
C,D,J
C,D,K
C,E,H
C,E,I
C,F,I
C,F,K
C,H,I
C,H,J
C,H,K
D,F,K
E,H,I