根据列上的模式计算行数
Count number of rows based in a pattern on a column
我有一个看起来像这样的数据集
pdbid ch spacegroup ph uniprotacc name
5TUE A P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUE B P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUF A P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUF B P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUI A P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUI B P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
6J3M A F 41 3 2 A0A059ZFC5 Phosphopantetheine adenylyltransferase
6JNH A F 41 3 2 A0A059ZFC5 Phosphopantetheine adenylyltransferase
6JOG A F 41 3 2 5.6 A0A059ZFC5 Phosphopantetheine adenylyltransferase
4BRZ A P 1 21 1 7 A0A067XG63 HALOALKANE DEHALOGENASE
4BRZ B P 1 21 1 7 A0A067XG63 HALOALKANE DEHALOGENASE
4C6H A P 21 21 2 A0A067XG66 HALOALKANE DEHALOGENASE
我想根据第 5 列 (uniprotacc) 计算每个条目。输出应该是这样的
pdbid ch spacegroup ph uniprotacc newval name
5TUE A P 21 21 21 A0A059WYP6 1 Tetracycline destructase Tet(50)
5TUE B P 21 21 21 A0A059WYP6 2 Tetracycline destructase Tet(50)
5TUF A P 21 21 21 A0A059WYP6 3 Tetracycline destructase Tet(50)
5TUF B P 21 21 21 A0A059WYP6 4 Tetracycline destructase Tet(50)
5TUI A P 21 21 21 A0A059WYP6 5 Tetracycline destructase Tet(50)
5TUI B P 21 21 21 A0A059WYP6 6 Tetracycline destructase Tet(50)
6J3M A F 41 3 2 A0A059ZFC5 1 Phosphopantetheine adenylyltransferase
6JNH A F 41 3 2 A0A059ZFC5 2 Phosphopantetheine adenylyltransferase
6JOG A F 41 3 2 5.6 A0A059ZFC5 3 Phosphopantetheine adenylyltransferase
4BRZ A P 1 21 1 7 A0A067XG63 1 HALOALKANE DEHALOGENASE
4BRZ B P 1 21 1 7 A0A067XG63 2 HALOALKANE DEHALOGENASE
4C6H A P 21 21 2 A0A067XG66 1 HALOALKANE DEHALOGENASE
我不知道,我想也许 awk 甚至 libreoffice-calc 可以轻松完成这项工作。但任何帮助表示赞赏。
文件是一个 tsv
意思是制表符分隔的文件。
在 Calc 中,将此公式放在第 F
列的第二行(newval 列)并向下拖动以填充。
=IF(E2=E1;F1+1;1)
这是一个 awk
脚本解决方案。
script.awk
BEGIN {FS = OFS = "\t"}
NR==1 {$NF = "newval" OFS $NF}
NR>1 {$NF = ++seen[$(NF - 1)] OFS $NF}
1
运行:
awk -f script.awk input.tsv
输出:
pdbid ch spacegroup ph uniprotacc newval name
5TUE A P 21 21 21 A0A059WYP6 1 Tetracycline destructase Tet(50)
5TUE B P 21 21 21 A0A059WYP6 2 Tetracycline destructase Tet(50)
5TUF A P 21 21 21 A0A059WYP6 3 Tetracycline destructase Tet(50)
5TUF B P 21 21 21 A0A059WYP6 4 Tetracycline destructase Tet(50)
5TUI A P 21 21 21 A0A059WYP6 5 Tetracycline destructase Tet(50)
5TUI B P 21 21 21 A0A059WYP6 6 Tetracycline destructase Tet(50)
6J3M A F 41 3 2 A0A059ZFC5 1 Phosphopantetheine adenylyltransferase
6JNH A F 41 3 2 A0A059ZFC5 2 Phosphopantetheine adenylyltransferase
6JOG A F 41 3 2 5.6 A0A059ZFC5 3 Phosphopantetheine adenylyltransferase
4BRZ A P 1 21 1 7 A0A067XG63 1 HALOALKANE DEHALOGENASE
4BRZ B P 1 21 1 7 A0A067XG63 2 HALOALKANE DEHALOGENASE
4C6H A P 21 21 2 A0A067XG66 1 HALOALKANE DEHALOGENASE
脚本解释:
BEGIN { # pre processing
FS = "\t"; # assign input field separator to "\t" tab
OFS = "\t"; # assign ouput field separator to "\t" tab
}
NR==1 { # processing first line
# $NF is the last field in input line
$NF = "newval" OFS $NF; # prefix last field with "newval" and tab
}
NR>1 { # processing non first line
# $(NF - 1) is the befroe last field in input line. Such as A0A059WYP6
# seen[$(NF - 1)] is an array couning the occurance of $(NF - 1)
# ++seen[$(NF - 1)] is an incremented array couning the occurance of $(NF - 1)
$NF = ++seen[$(NF - 1)] OFS $NF; # prefix last field with an incremented array couning the occurance of $(NF - 1) and tab
}
{print} # print every processed line
我有一个看起来像这样的数据集
pdbid ch spacegroup ph uniprotacc name
5TUE A P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUE B P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUF A P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUF B P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUI A P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUI B P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
6J3M A F 41 3 2 A0A059ZFC5 Phosphopantetheine adenylyltransferase
6JNH A F 41 3 2 A0A059ZFC5 Phosphopantetheine adenylyltransferase
6JOG A F 41 3 2 5.6 A0A059ZFC5 Phosphopantetheine adenylyltransferase
4BRZ A P 1 21 1 7 A0A067XG63 HALOALKANE DEHALOGENASE
4BRZ B P 1 21 1 7 A0A067XG63 HALOALKANE DEHALOGENASE
4C6H A P 21 21 2 A0A067XG66 HALOALKANE DEHALOGENASE
我想根据第 5 列 (uniprotacc) 计算每个条目。输出应该是这样的
pdbid ch spacegroup ph uniprotacc newval name
5TUE A P 21 21 21 A0A059WYP6 1 Tetracycline destructase Tet(50)
5TUE B P 21 21 21 A0A059WYP6 2 Tetracycline destructase Tet(50)
5TUF A P 21 21 21 A0A059WYP6 3 Tetracycline destructase Tet(50)
5TUF B P 21 21 21 A0A059WYP6 4 Tetracycline destructase Tet(50)
5TUI A P 21 21 21 A0A059WYP6 5 Tetracycline destructase Tet(50)
5TUI B P 21 21 21 A0A059WYP6 6 Tetracycline destructase Tet(50)
6J3M A F 41 3 2 A0A059ZFC5 1 Phosphopantetheine adenylyltransferase
6JNH A F 41 3 2 A0A059ZFC5 2 Phosphopantetheine adenylyltransferase
6JOG A F 41 3 2 5.6 A0A059ZFC5 3 Phosphopantetheine adenylyltransferase
4BRZ A P 1 21 1 7 A0A067XG63 1 HALOALKANE DEHALOGENASE
4BRZ B P 1 21 1 7 A0A067XG63 2 HALOALKANE DEHALOGENASE
4C6H A P 21 21 2 A0A067XG66 1 HALOALKANE DEHALOGENASE
我不知道,我想也许 awk 甚至 libreoffice-calc 可以轻松完成这项工作。但任何帮助表示赞赏。
文件是一个 tsv
意思是制表符分隔的文件。
在 Calc 中,将此公式放在第 F
列的第二行(newval 列)并向下拖动以填充。
=IF(E2=E1;F1+1;1)
这是一个 awk
脚本解决方案。
script.awk
BEGIN {FS = OFS = "\t"}
NR==1 {$NF = "newval" OFS $NF}
NR>1 {$NF = ++seen[$(NF - 1)] OFS $NF}
1
运行:
awk -f script.awk input.tsv
输出:
pdbid ch spacegroup ph uniprotacc newval name
5TUE A P 21 21 21 A0A059WYP6 1 Tetracycline destructase Tet(50)
5TUE B P 21 21 21 A0A059WYP6 2 Tetracycline destructase Tet(50)
5TUF A P 21 21 21 A0A059WYP6 3 Tetracycline destructase Tet(50)
5TUF B P 21 21 21 A0A059WYP6 4 Tetracycline destructase Tet(50)
5TUI A P 21 21 21 A0A059WYP6 5 Tetracycline destructase Tet(50)
5TUI B P 21 21 21 A0A059WYP6 6 Tetracycline destructase Tet(50)
6J3M A F 41 3 2 A0A059ZFC5 1 Phosphopantetheine adenylyltransferase
6JNH A F 41 3 2 A0A059ZFC5 2 Phosphopantetheine adenylyltransferase
6JOG A F 41 3 2 5.6 A0A059ZFC5 3 Phosphopantetheine adenylyltransferase
4BRZ A P 1 21 1 7 A0A067XG63 1 HALOALKANE DEHALOGENASE
4BRZ B P 1 21 1 7 A0A067XG63 2 HALOALKANE DEHALOGENASE
4C6H A P 21 21 2 A0A067XG66 1 HALOALKANE DEHALOGENASE
脚本解释:
BEGIN { # pre processing
FS = "\t"; # assign input field separator to "\t" tab
OFS = "\t"; # assign ouput field separator to "\t" tab
}
NR==1 { # processing first line
# $NF is the last field in input line
$NF = "newval" OFS $NF; # prefix last field with "newval" and tab
}
NR>1 { # processing non first line
# $(NF - 1) is the befroe last field in input line. Such as A0A059WYP6
# seen[$(NF - 1)] is an array couning the occurance of $(NF - 1)
# ++seen[$(NF - 1)] is an incremented array couning the occurance of $(NF - 1)
$NF = ++seen[$(NF - 1)] OFS $NF; # prefix last field with an incremented array couning the occurance of $(NF - 1) and tab
}
{print} # print every processed line