Python/Biopython。在用蛋白质序列解析文件后获取匹配单词的序列枚举列表

Question

在 Python/Biopython 中，我试图获取与字符串 "Human adenovirus" 匹配的蛋白质序列枚举列表。下面代码的问题是我得到了要解析的序列的枚举，但没有通过 if 循环过滤器的序列的枚举。

具有正确语法的编辑代码：

from Bio import SeqIO
import sys  
sys.stdout = open("out_file.txt","w")

for index, seq_record in enumerate(SeqIO.parse("in_file.txt", "fasta")):
    if "Human adenovirus" in seq_record.description:

        print "%i]" % index, str(seq_record.description) 
        print str(seq_record.seq) + "\n"

这是输入文件的一部分：

>gi|927348286|gb|ALE15299.1| penton [Bottlenose dolphin adenovirus 1]
MQRPQQTPPPPYESVVEPLYVPSRYLAPSEGRNSIRYSQLPPLYD

>gi|15485528|emb|CAC67483.1| penton [Human adenovirus 2]
MQRAAMYEEGPPPSYESVVSAAPVAAALGSPFDAPLDPPFVPPRYLRPTGGRNSIRYSELAPLFDTTRVY
LVDNKSTDVASLNYQNDHSNFLTTVIQNNDY

>gi|1194445857|dbj|BAX56610.1| fiber, partial [Human mastadenovirus C]
FNPVYPYDTETGPPTVPFLTPPFVSPNG

我得到的输出文件如下所示：

2] gi|15485528|emb|CAC67483.1| penton [Human adenovirus 2]
MQRAAMYEEGPPPSYESVVSAAPVAAALGSPFDAPLDPPFVPPRYLRPTGGRNSIRYSELAPLFDTTRVY
LVDNKSTDVASLNYQNDHSNFLTTVIQNNDY

我希望第一个通过过滤器的序列获得以 1] 开头的枚举，而不是之前显示的以 2] 开始的枚举。我知道我需要以某种方式在 if 循环之后添加一个计数器，但我尝试了很多替代方法，但没有得到所需的输出。这应该不难，我知道如何在 Perl 中做到这一点，但不会用 Python/Biopython.

Answer 1

问题是您只想在描述包含 "Human adenovirus" 时增加索引，但您正在枚举所有内容。

如果我们修改您的代码示例以仅在找到匹配项时增加索引，我们会得到：

from Bio import SeqIO
index = 0
with open("out_file.txt","w") as f:
    for seq_record in SeqIO.parse("in_file.txt", "fasta"):
        if "Human adenovirus" in seq_record.description:
            index += 1
            print "%i]" % index, str(seq_record.description) 
            print str(seq_record.seq) + "\n"

顺便说一句，为什么您打开文件进行写入，但从不写入？

Python/Biopython。在用蛋白质序列解析文件后获取匹配单词的序列枚举列表

Python/Biopython. Get enumerated list of sequences matching words after parsing file with protein sequences

python

parsing

loops

biopython