创建一个新数组

Question

I am writing a program that parses sequence alleles. I have written code that reads a file and creates a header array and a sequence array. Here is an example of a file:

>DQB1*04:02:01
------------------------------------------------------------
--ATGTCTTGGAAGAAGGCTTTGCGGAT-------CCCTGGAGGCCTTCGGGTAGCAACT
GTGACCTT----GATGCTGGCGATGCTGAGCACCCCGGTGGCTGAGGGCAGAGACTCTCC
CGAGGATTTCGTGTTCCAGTTTAAGGGCATGTGCTACTTCACCAACGGGACCGAGCGCGT
GTTGGAGCTCCGCACGACCTTGCAGCGGCGA-----------------------------
---GTGGAGCCCACAGTGACCATCTCCCCATCCAGGACAGAGGCCCTCAACCACCACAAC
CTGCTGGTCTGCTCAGTGACAG----CATTGGAGGCTTCGTGCTGGGGCTGATCTTCCTC
GGGCTGGGCCTTATTATC--------------CATCACAGGAGTCAGAAAGGGCTCCTGC
ACTGA-------------------------------------------------------
>OMIXON_CONSENSUS_M_155_09_4890_DQB1*04:02:01
-------------------ATCAGGTCCAAGCTGTGTTGACTACCACTACTTTTCCCTTC
GTCTCAATTATGTCTTGGAAGAAGGCTTTGCGGATCCCTGGAGGCCTTCGGGTAGCAACT
GTGACCTTGATGCTGGCGATGCTGAGCACCCCGGTGGCTGAGGGCAGAGACTCTCCCGGT
AAGTGCAGGGCCACTGCTCTCCAGAGCCGCCACTCTGGGAACAGGCTCTCCTTGGGCTGG
GGTAGGGGGATGGTGATCTCCATGATCTCGGACACAATCTTTCATCAACATTTCCTCTCT
TTGGGGAAAGAGAACGATGTTGCATTCCCATTTATCTTT---------------------
>GENDX_CONSENSUS_M_155_09_4890_DQB1*04:02:01
TGCCAGGTACATCAGATCCATCAGGTCCAAGCTGTGTTGACTACCACTACTTTTCCCTTC
GTCTCAATTATGTCTTGGAAGAAGGCTTTGCGGATCCCTGGAGGCCTTCGGGTAGCAACT
GTGACCTTGATGCTGGCGATGCTGAGCACCCCGGTGGCTGAGGGCAGAGACTCTCCCGGT
AAGTGCAGGGCCACTGCTCTCCAGAGCCGCCACTCTGGGAACAGGCTCTCCTTGGGCTGG
GGTAGGGGGATGGTGATCTCCATGATCTCGGACACAATCTTTCATCAACATTTCCTCTCT

The headers are ('>DQB1', '>GENDX', and '>OMIXON') and the three sequences are the other three strings as seen above.

The next part of my code detects if the allele sequence is complete or incomplete. An allele is determined as "incomplete" if there are more than 4 breaks within the >DQB1 sequence. (A break is signified by a '-'). For example, the above sequence is broken because there are five breaks.

I am trying to write code where if there is an incomplete allele detected, the program creates a new array with only the >GENDX and the >OMIXON headers and sequences.

How can I make an array that does not include >DQB1?

Here is my code as is:

import sys, re

max_num_breaks=4
filename=sys.argv[1]
f=open(filename,"r")
header=[]
header2=[]
sequence=[]
sequence2=[]
string=""
for line in f:
    if ">" in line and string=="":
        header.append(line[:-1])
    elif ">" in line and string!="":
        sequence.append(string)
        header.append(line[:-1])
        string=""
    else:
        string=string+line[:-1]
sequence.append(string)
s1=sequence[0]
breaks=sum(1 for m in re.finditer("-+",''.join(s1.splitlines())))
if breaks>max_num_breaks:
    print "Incomplete Reference Allele Detected"
    for m in range(len(header)):
        if re.finditer(header[m], 'OMIXON') or re.finditer(header[m], 'GENDX'):
            header2.append(header[m])
            sequence2.append(sequence[m])
    print header2

The problem with the above code is that whenever I print header2 it still includes the DQB1.

Answer 1

为什么要使用 re.finditer？

怎么样

if header[m].find('OMIXON') > -1 or header[m].find('GENDX') > -1:

Answer 2

假设您的序列保存在 FASTA 文件中，您可以使用 Biopython 轻松完成此操作。

from Bio import SeqIO

headers = [record.id for record in SeqIO.parse("myfile.fasta", "fasta")][1:]

大功告成。

如果要从parse()对象中获取序列部分，只需使用record.seq。

Answer 3

re.finditer 函数并不像您想象的那样。有关示例，请参阅 here。

我建议改用这个：

if header[m][1:7] == 'OMIXON' or header[m][1:6]=='GENDX':

创建一个新数组

Make a new array

python

arrays

header

sequence

python-2.7