拆分列并计算每行确定字符的出现次数

Splitting columns and counting occurrences of determined character per line

我正在尝试计算每行每个 C T A G 的出现次数总和,并将其输出到每一行的末尾。 所以,我的输入看起来像这样

NC_044998.1  3749  0  GG  0  GG  0  GG  0  GG  1  GC  0  GG  0  GG  0  GG  0  GG  0  GG  0  GG 
NC_044998.1  3755  1  TA  0  TT  0  TT  1  TA  1  TA  1  TA  1  TA  0  TT  1  TA  0  TT  1  TA
NC_044998.1  4012  0  TT  1  TA  1  TA  0  TT  0  TT  0  TT  0  TT  1  TA  0  TT  1  TA  0  TT
NC_044998.1  5298  1  GA  0  GG  0  GG  0  GG  0  GG  0  GG  0  GG  0  GG  1  GA  0  GG  0  GG

所需的输出如下所示

NC_044998.1  3749  0  GG  0  GG  0  GG  0  GG  1  GC  0  GG  0  GG  0  GG  0  GG  0  GG  0  GG 1 0 0 21
NC_044998.1  3755  1  TA  0  TT  0  TT  1  TA  1  TA  1  TA  1  TA  0  TT  1  TA  0  TT  1  TA 0 15 7 0
NC_044998.1  4012  0  TT  1  TA  1  TA  0  TT  0  TT  0  TT  0  TT  1  TA  0  TT  1  TA  0  TT 0 18 4 0
NC_044998.1  5298  1  GA  0  GG  0  GG  0  GG  0  GG  0  GG  0  GG  0  GG  1  GA  0  GG  0  GG 0 0 2 20

我试过修改它,但它只为所有行打印 0 0 0 0

BEGIN {
    numTags = split("C T A G",tags)
}
{
    s = 0
    for (i=4; i<=24; i+=2) {
        for (j=1; j<=2; ++j)
        tag = substr($i,j,1)
        s+=tag
    }
    printf "%s", [=12=]
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        s = tags[tagNr]
        printf "%s%d", OFS, s
    }
    print ""
}
$ awk '{ o=[=10=]; =""; print o, gsub(/C/,""), gsub(/T/,""), gsub(/A/,""), gsub(/G/,"") }' file
NC_044998.1  3749  0  GG  0  GG  0  GG  0  GG  1  GC  0  GG  0  GG  0  GG  0  GG  0  GG  0  GG 1 0 0 21
NC_044998.1  3755  1  TA  0  TT  0  TT  1  TA  1  TA  1  TA  1  TA  0  TT  1  TA  0  TT  1  TA 0 15 7 0
NC_044998.1  4012  0  TT  1  TA  1  TA  0  TT  0  TT  0  TT  0  TT  1  TA  0  TT  1  TA  0  TT 0 18 4 0
NC_044998.1  5298  1  GA  0  GG  0  GG  0  GG  0  GG  0  GG  0  GG  0  GG  1  GA  0  GG  0  GG 0 0 2 20

如果您想使用标签(字符)数组来避免显式调用 gsub() 4 次,那么您可以这样做:

$ cat tst.awk
BEGIN {
    numTags = split("C T A G",tags)
}
{
    printf "%s", [=11=]
     = ""
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tags[tagNr]
        printf "%s%d", OFS, gsub(tag,"")
    }
    print ""
}

$ awk -f tst.awk file
NC_044998.1  3749  0  GG  0  GG  0  GG  0  GG  1  GC  0  GG  0  GG  0  GG  0  GG  0  GG  0  GG 1 0 0 21
NC_044998.1  3755  1  TA  0  TT  0  TT  1  TA  1  TA  1  TA  1  TA  0  TT  1  TA  0  TT  1  TA 0 15 7 0
NC_044998.1  4012  0  TT  1  TA  1  TA  0  TT  0  TT  0  TT  0  TT  1  TA  0  TT  1  TA  0  TT 0 18 4 0
NC_044998.1  5298  1  GA  0  GG  0  GG  0  GG  0  GG  0  GG  0  GG  0  GG  1  GA  0  GG  0  GG 0 0 2 20

但是恕我直言,除非您认为将来会添加其他标签,否则对于这个特定问题来说,这太过分了。