循环连接fasta文件
Concatenate fasta files in loop
我有多个 fasta 文件,其中包含多个具有相同长度序列的个体。我想要做的是沿着 fasta 文件创建一个物种序列的串联。
在循环中:如果在下一个文件中找到一个物种,我连接它的序列,如果没有,我连接间隙('-'),与其余序列的长度相同。 (查看文件是否对齐)
species_list = []
files = [file for file in glob.glob('~/*.fa')]
for aln in files:
with open (aln, 'rU') as multispecies:
sequences = SeqIO.parse(multispecies, 'fasta')
for species in sequences:
species_list.append(species.id)
species_list=list(set(species_list))
#print(species_list)
concat = {}
for aln in files:
#print(aln)
dict = {}
with open (aln, 'rU') as multispecies:
sequences = SeqIO.parse(multispecies, 'fasta')
names = []
for fasta in sequences:
names.append(fasta.id)
dict[fasta.id] = fasta.seq
count_species = 0
for i in species_list:
if i in names:
count_species = count_species + 1
print('>' + i + '\n' + dict[i])
gap = int(len(dict[i]))
concat[i] += dict[i] #I cannot find a way to concatenate here
else:
print('>' + i + '\n' + '-'*gap)
concat[i] += '-'*gap #I cannot find a way to concatenate here
您的 concat 应该是 defaultdict 或通过创建某种可迭代风格来处理丢失的键,在这种情况下可能是 str 或最好是列表。然后你可以用新值扩展迭代器:
# list-based
concat.setdefault(i, []).extend(dict[i]) # should work if you keep the data in a list
# string-based
concat[i] = concat.get(i, '') + dict[i]
虽然基于字符串的方法效率极低,因为您必须在每次连接时从头开始重建字符串。如果你需要一个字符串,你总是可以把它做成一个列表然后“”。一旦你完成构建它就加入它。
我有多个 fasta 文件,其中包含多个具有相同长度序列的个体。我想要做的是沿着 fasta 文件创建一个物种序列的串联。
在循环中:如果在下一个文件中找到一个物种,我连接它的序列,如果没有,我连接间隙('-'),与其余序列的长度相同。 (查看文件是否对齐)
species_list = []
files = [file for file in glob.glob('~/*.fa')]
for aln in files:
with open (aln, 'rU') as multispecies:
sequences = SeqIO.parse(multispecies, 'fasta')
for species in sequences:
species_list.append(species.id)
species_list=list(set(species_list))
#print(species_list)
concat = {}
for aln in files:
#print(aln)
dict = {}
with open (aln, 'rU') as multispecies:
sequences = SeqIO.parse(multispecies, 'fasta')
names = []
for fasta in sequences:
names.append(fasta.id)
dict[fasta.id] = fasta.seq
count_species = 0
for i in species_list:
if i in names:
count_species = count_species + 1
print('>' + i + '\n' + dict[i])
gap = int(len(dict[i]))
concat[i] += dict[i] #I cannot find a way to concatenate here
else:
print('>' + i + '\n' + '-'*gap)
concat[i] += '-'*gap #I cannot find a way to concatenate here
您的 concat 应该是 defaultdict 或通过创建某种可迭代风格来处理丢失的键,在这种情况下可能是 str 或最好是列表。然后你可以用新值扩展迭代器:
# list-based
concat.setdefault(i, []).extend(dict[i]) # should work if you keep the data in a list
# string-based
concat[i] = concat.get(i, '') + dict[i]
虽然基于字符串的方法效率极低,因为您必须在每次连接时从头开始重建字符串。如果你需要一个字符串,你总是可以把它做成一个列表然后“”。一旦你完成构建它就加入它。