拆分列并计算每行确定字符的出现次数
Splitting columns and counting occurrences of determined character per line
我正在尝试计算每行每个 C T A G 的出现次数总和,并将其输出到每一行的末尾。
所以,我的输入看起来像这样
NC_044998.1 3749 0 GG 0 GG 0 GG 0 GG 1 GC 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG
NC_044998.1 3755 1 TA 0 TT 0 TT 1 TA 1 TA 1 TA 1 TA 0 TT 1 TA 0 TT 1 TA
NC_044998.1 4012 0 TT 1 TA 1 TA 0 TT 0 TT 0 TT 0 TT 1 TA 0 TT 1 TA 0 TT
NC_044998.1 5298 1 GA 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 1 GA 0 GG 0 GG
所需的输出如下所示
NC_044998.1 3749 0 GG 0 GG 0 GG 0 GG 1 GC 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 1 0 0 21
NC_044998.1 3755 1 TA 0 TT 0 TT 1 TA 1 TA 1 TA 1 TA 0 TT 1 TA 0 TT 1 TA 0 15 7 0
NC_044998.1 4012 0 TT 1 TA 1 TA 0 TT 0 TT 0 TT 0 TT 1 TA 0 TT 1 TA 0 TT 0 18 4 0
NC_044998.1 5298 1 GA 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 1 GA 0 GG 0 GG 0 0 2 20
我试过修改它,但它只为所有行打印 0 0 0 0
BEGIN {
numTags = split("C T A G",tags)
}
{
s = 0
for (i=4; i<=24; i+=2) {
for (j=1; j<=2; ++j)
tag = substr($i,j,1)
s+=tag
}
printf "%s", [=12=]
for (tagNr=1; tagNr<=numTags; tagNr++) {
s = tags[tagNr]
printf "%s%d", OFS, s
}
print ""
}
$ awk '{ o=[=10=]; =""; print o, gsub(/C/,""), gsub(/T/,""), gsub(/A/,""), gsub(/G/,"") }' file
NC_044998.1 3749 0 GG 0 GG 0 GG 0 GG 1 GC 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 1 0 0 21
NC_044998.1 3755 1 TA 0 TT 0 TT 1 TA 1 TA 1 TA 1 TA 0 TT 1 TA 0 TT 1 TA 0 15 7 0
NC_044998.1 4012 0 TT 1 TA 1 TA 0 TT 0 TT 0 TT 0 TT 1 TA 0 TT 1 TA 0 TT 0 18 4 0
NC_044998.1 5298 1 GA 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 1 GA 0 GG 0 GG 0 0 2 20
如果您想使用标签(字符)数组来避免显式调用 gsub()
4 次,那么您可以这样做:
$ cat tst.awk
BEGIN {
numTags = split("C T A G",tags)
}
{
printf "%s", [=11=]
= ""
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
printf "%s%d", OFS, gsub(tag,"")
}
print ""
}
$ awk -f tst.awk file
NC_044998.1 3749 0 GG 0 GG 0 GG 0 GG 1 GC 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 1 0 0 21
NC_044998.1 3755 1 TA 0 TT 0 TT 1 TA 1 TA 1 TA 1 TA 0 TT 1 TA 0 TT 1 TA 0 15 7 0
NC_044998.1 4012 0 TT 1 TA 1 TA 0 TT 0 TT 0 TT 0 TT 1 TA 0 TT 1 TA 0 TT 0 18 4 0
NC_044998.1 5298 1 GA 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 1 GA 0 GG 0 GG 0 0 2 20
但是恕我直言,除非您认为将来会添加其他标签,否则对于这个特定问题来说,这太过分了。
我正在尝试计算每行每个 C T A G 的出现次数总和,并将其输出到每一行的末尾。 所以,我的输入看起来像这样
NC_044998.1 3749 0 GG 0 GG 0 GG 0 GG 1 GC 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG
NC_044998.1 3755 1 TA 0 TT 0 TT 1 TA 1 TA 1 TA 1 TA 0 TT 1 TA 0 TT 1 TA
NC_044998.1 4012 0 TT 1 TA 1 TA 0 TT 0 TT 0 TT 0 TT 1 TA 0 TT 1 TA 0 TT
NC_044998.1 5298 1 GA 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 1 GA 0 GG 0 GG
所需的输出如下所示
NC_044998.1 3749 0 GG 0 GG 0 GG 0 GG 1 GC 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 1 0 0 21
NC_044998.1 3755 1 TA 0 TT 0 TT 1 TA 1 TA 1 TA 1 TA 0 TT 1 TA 0 TT 1 TA 0 15 7 0
NC_044998.1 4012 0 TT 1 TA 1 TA 0 TT 0 TT 0 TT 0 TT 1 TA 0 TT 1 TA 0 TT 0 18 4 0
NC_044998.1 5298 1 GA 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 1 GA 0 GG 0 GG 0 0 2 20
我试过修改它,但它只为所有行打印 0 0 0 0
BEGIN {
numTags = split("C T A G",tags)
}
{
s = 0
for (i=4; i<=24; i+=2) {
for (j=1; j<=2; ++j)
tag = substr($i,j,1)
s+=tag
}
printf "%s", [=12=]
for (tagNr=1; tagNr<=numTags; tagNr++) {
s = tags[tagNr]
printf "%s%d", OFS, s
}
print ""
}
$ awk '{ o=[=10=]; =""; print o, gsub(/C/,""), gsub(/T/,""), gsub(/A/,""), gsub(/G/,"") }' file
NC_044998.1 3749 0 GG 0 GG 0 GG 0 GG 1 GC 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 1 0 0 21
NC_044998.1 3755 1 TA 0 TT 0 TT 1 TA 1 TA 1 TA 1 TA 0 TT 1 TA 0 TT 1 TA 0 15 7 0
NC_044998.1 4012 0 TT 1 TA 1 TA 0 TT 0 TT 0 TT 0 TT 1 TA 0 TT 1 TA 0 TT 0 18 4 0
NC_044998.1 5298 1 GA 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 1 GA 0 GG 0 GG 0 0 2 20
如果您想使用标签(字符)数组来避免显式调用 gsub()
4 次,那么您可以这样做:
$ cat tst.awk
BEGIN {
numTags = split("C T A G",tags)
}
{
printf "%s", [=11=]
= ""
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
printf "%s%d", OFS, gsub(tag,"")
}
print ""
}
$ awk -f tst.awk file
NC_044998.1 3749 0 GG 0 GG 0 GG 0 GG 1 GC 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 1 0 0 21
NC_044998.1 3755 1 TA 0 TT 0 TT 1 TA 1 TA 1 TA 1 TA 0 TT 1 TA 0 TT 1 TA 0 15 7 0
NC_044998.1 4012 0 TT 1 TA 1 TA 0 TT 0 TT 0 TT 0 TT 1 TA 0 TT 1 TA 0 TT 0 18 4 0
NC_044998.1 5298 1 GA 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 0 GG 1 GA 0 GG 0 GG 0 0 2 20
但是恕我直言,除非您认为将来会添加其他标签,否则对于这个特定问题来说,这太过分了。