我有一个蛋白质序列文件,我想使用 sed 或 grep 计算其中的三聚体
i have a protein sequence file i want to count trimers in it using sed or grep
我有以下格式的蛋白质序列文件
uniprotID\space\sequence
sequence 是任意长度的字符串,但只允许包含 20 个字母,即
ARNDCQEGHILKMFPSTWYV
1 条记录的示例
Q5768D AKCCACAKCCAC
我想创建以下格式的 csv 文件
Q5768D
12
ACA 1
AKC 2
CAC 2
CAK 1
CCA 2
KCC 2
这是我目前正在尝试的:
#!/bin/sh
while read ID SEQ # uniprot along with sequences
do
echo $SEQ | tr -d '[[:space:]]' | sed 's/./& /g' > TEST_FILE
declare -a SSA=(`cat TEST_FILE`)
SQL=$(echo ${#SSA[@]})
for (( X=0; X <= "$SQL"; X++ ))
do
Y=$(expr $X + 1)
Z=$(expr $X + 2)
echo ${SSA[X]} ${SSA[Y]} ${SSA[Z]}
done | awk '{if (NF == 3) print}' | tr -d ' ' > TEMPTRIMER
rm TEST_FILE # removing temporary sequence file
sort TEMPTRIMER|uniq -c > $ID.$SQL
done <
在这段代码中,我将个人记录存储在另一个不好的文件中。此外,该程序在 12 小时内非常缓慢,在 50 万条记录中仅访问了 12000 条记录。
如果这是你想要的:
$ cat file
Q5768D AKCCACAKCCAC
OTHER FOOBARFOOBAR
$
$ awk -f tst.awk file
Q5768D OTHER
12 12
AKC 2 FOO 2
KCC 2 OOB 2
CCA 2 OBA 2
CAC 2 BAR 2
ACA 1 ARF 1
CAK 1 RFO 1
这样做就可以了:
$ cat tst.awk
BEGIN { OFS="\t" }
{
colNr = NR
rowNr = 0
name[colNr] =
lgth[colNr] = length()
delete name2nr
for (i=1;i<=(length()-2);i++) {
trimer = substr(,i,3)
if ( !(trimer in name2nr) ) {
name2nr[trimer] = ++rowNr
nr2name[colNr,rowNr] = trimer
}
cnt[colNr,name2nr[trimer]]++
}
numCols = colNr
numRows = (rowNr > numRows ? rowNr : numRows)
}
END {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", name[colNr], (colNr<numCols?OFS:ORS)
}
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", lgth[colNr], (colNr<numCols?OFS:ORS)
}
for (rowNr=1;rowNr<=numRows;rowNr++) {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s %s%s", nr2name[colNr,rowNr], cnt[colNr,rowNr], (colNr<numCols?OFS:ORS)
}
}
}
如果您想要像 那样的输出,那将比上面的输出更简单、更高效并且使用更少的内存:
$ cat tst2.awk
{
delete cnt
for (i=1;i<=(length()-2);i++) {
cnt[substr(,i,3)]++
}
printf "%s;%s", , length()
for (trimer in cnt) {
printf ";%s=%s", trimer, cnt[trimer]
}
print ""
}
$ awk -f tst2.awk file
Q5768D;12;ACA=1;KCC=2;CAK=1;CAC=2;CCA=2;AKC=2
OTHER;12;RFO=1;FOO=2;OBA=2;OOB=2;ARF=1;BAR=2
此 perl 脚本处理 cca 550'000 "trimmers"/秒。 (随机有效测试序列 0-8000 个字符长,100k 条记录(~400MB)产生 2GB 输出 csv)
输出:
Q1024A;421;AAF=1;AAK=1;AFC=1;AFE=2;AGP=1;AHC=1;AHE=1;AIV=1;AKN=1;AMC=1;AQD=1;AQY=1;...
Q1074F;6753;AAA=1;AAD=1;AAE=1;AAF=2;AAN=2;AAP=2;AAT=1;ACA=1;ACC=1;ACD=1;ACE=3;ACF=2;...
代码:
#!/usr/bin/perl
use strict;
$|=1;
my $c;
# process each line on input
while (readline STDIN) {
$c++; chomp;
# is it a valid line? has the format and a sequence to process
if (m~^(\w+)\s+([ARNDCQEGHILKMFPSTWYV]+)\r?$~ and ) {
print join ";",(,length());
my %trimdb;
my $seq=;
#split the sequence into chars
my @a=split //,$seq;
my @trimmer;
# while there are unprocessed chars in the sequence...
while (scalar @a) {
# fill up the buffer with a char from the top of the sequence
push @trimmer, shift @a;
# if the buffer is full (has 3 chars), increase the trimer frequency
if (scalar @trimmer == 3 ) {
$trimdb{(join "",@trimmer)}++;
# drop the first letter from buffer, for next loop
shift @trimmer;
}
}
# we're done with the sequence - print the sorted list of trimers
foreach (sort keys %trimdb) {
#print in a csv (;) line
print ";$_=$trimdb{$_}";
}
print"\n";
}
else {
#the input line was not valid.
print STDERR "input error: $_\n";
}
# just a progress counter
printf STDERR "%8i\r",$c if not $c%100;
}
print STDERR "\n";
如果您安装了 perl(大多数 linux 都安装了,请检查路径 /usr/bin/perl 或替换为您的路径),只需 运行:./count_trimers.pl < your_input_file.txt > output.csv
我有以下格式的蛋白质序列文件
uniprotID\space\sequence
sequence 是任意长度的字符串,但只允许包含 20 个字母,即
ARNDCQEGHILKMFPSTWYV
1 条记录的示例
Q5768D AKCCACAKCCAC
我想创建以下格式的 csv 文件
Q5768D
12
ACA 1
AKC 2
CAC 2
CAK 1
CCA 2
KCC 2
这是我目前正在尝试的:
#!/bin/sh
while read ID SEQ # uniprot along with sequences
do
echo $SEQ | tr -d '[[:space:]]' | sed 's/./& /g' > TEST_FILE
declare -a SSA=(`cat TEST_FILE`)
SQL=$(echo ${#SSA[@]})
for (( X=0; X <= "$SQL"; X++ ))
do
Y=$(expr $X + 1)
Z=$(expr $X + 2)
echo ${SSA[X]} ${SSA[Y]} ${SSA[Z]}
done | awk '{if (NF == 3) print}' | tr -d ' ' > TEMPTRIMER
rm TEST_FILE # removing temporary sequence file
sort TEMPTRIMER|uniq -c > $ID.$SQL
done <
在这段代码中,我将个人记录存储在另一个不好的文件中。此外,该程序在 12 小时内非常缓慢,在 50 万条记录中仅访问了 12000 条记录。
如果这是你想要的:
$ cat file
Q5768D AKCCACAKCCAC
OTHER FOOBARFOOBAR
$
$ awk -f tst.awk file
Q5768D OTHER
12 12
AKC 2 FOO 2
KCC 2 OOB 2
CCA 2 OBA 2
CAC 2 BAR 2
ACA 1 ARF 1
CAK 1 RFO 1
这样做就可以了:
$ cat tst.awk
BEGIN { OFS="\t" }
{
colNr = NR
rowNr = 0
name[colNr] =
lgth[colNr] = length()
delete name2nr
for (i=1;i<=(length()-2);i++) {
trimer = substr(,i,3)
if ( !(trimer in name2nr) ) {
name2nr[trimer] = ++rowNr
nr2name[colNr,rowNr] = trimer
}
cnt[colNr,name2nr[trimer]]++
}
numCols = colNr
numRows = (rowNr > numRows ? rowNr : numRows)
}
END {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", name[colNr], (colNr<numCols?OFS:ORS)
}
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", lgth[colNr], (colNr<numCols?OFS:ORS)
}
for (rowNr=1;rowNr<=numRows;rowNr++) {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s %s%s", nr2name[colNr,rowNr], cnt[colNr,rowNr], (colNr<numCols?OFS:ORS)
}
}
}
如果您想要像
$ cat tst2.awk
{
delete cnt
for (i=1;i<=(length()-2);i++) {
cnt[substr(,i,3)]++
}
printf "%s;%s", , length()
for (trimer in cnt) {
printf ";%s=%s", trimer, cnt[trimer]
}
print ""
}
$ awk -f tst2.awk file
Q5768D;12;ACA=1;KCC=2;CAK=1;CAC=2;CCA=2;AKC=2
OTHER;12;RFO=1;FOO=2;OBA=2;OOB=2;ARF=1;BAR=2
此 perl 脚本处理 cca 550'000 "trimmers"/秒。 (随机有效测试序列 0-8000 个字符长,100k 条记录(~400MB)产生 2GB 输出 csv)
输出:
Q1024A;421;AAF=1;AAK=1;AFC=1;AFE=2;AGP=1;AHC=1;AHE=1;AIV=1;AKN=1;AMC=1;AQD=1;AQY=1;...
Q1074F;6753;AAA=1;AAD=1;AAE=1;AAF=2;AAN=2;AAP=2;AAT=1;ACA=1;ACC=1;ACD=1;ACE=3;ACF=2;...
代码:
#!/usr/bin/perl
use strict;
$|=1;
my $c;
# process each line on input
while (readline STDIN) {
$c++; chomp;
# is it a valid line? has the format and a sequence to process
if (m~^(\w+)\s+([ARNDCQEGHILKMFPSTWYV]+)\r?$~ and ) {
print join ";",(,length());
my %trimdb;
my $seq=;
#split the sequence into chars
my @a=split //,$seq;
my @trimmer;
# while there are unprocessed chars in the sequence...
while (scalar @a) {
# fill up the buffer with a char from the top of the sequence
push @trimmer, shift @a;
# if the buffer is full (has 3 chars), increase the trimer frequency
if (scalar @trimmer == 3 ) {
$trimdb{(join "",@trimmer)}++;
# drop the first letter from buffer, for next loop
shift @trimmer;
}
}
# we're done with the sequence - print the sorted list of trimers
foreach (sort keys %trimdb) {
#print in a csv (;) line
print ";$_=$trimdb{$_}";
}
print"\n";
}
else {
#the input line was not valid.
print STDERR "input error: $_\n";
}
# just a progress counter
printf STDERR "%8i\r",$c if not $c%100;
}
print STDERR "\n";
如果您安装了 perl(大多数 linux 都安装了,请检查路径 /usr/bin/perl 或替换为您的路径),只需 运行:./count_trimers.pl < your_input_file.txt > output.csv