保存一串文本,如果它由不同的特定文本字符串进行?

Save a string of text, if its proceeded by a different specific string of text?

抱歉标题不好,我不知道如何表达我的问题。

我写了一个脚本,可以从 fastq 文件(纯文本基因组读取文件)中提取数据。第 1 行是 header,第 2 行是基本字符串 - 不需要第 3 行和第 4 行。

filename = 'C0_GGCTAC_R1_no_adapter_trimming.fastq'
new_filename = filename[:-9] + '_new.fastq'

with open(filename) as f_obj:
    file_contents = f_obj.readlines()

extracted_lines = ''
line_count = 0

# Pull header and base lines
for line in file_contents:
    line_count += 1
    # Headers
    if line_count == 1:
        extracted_lines += line
    # Reads ending in A
    elif line_count == 2 and line[-2] == 'A':
        extracted_lines += line
    # Reads ending in G
    elif line_count == 2 and line[-2] == 'G':
        extracted_lines += line
    # Reset counter
    elif line_count == 4:
        line_count = 0

with open(new_filename, 'w') as f_obj:
    f_obj.write(extracted_lines)
print(new_filename + " was created.")

脚本拉取每个读取的 header,以及读取中的碱基串,只要读取的碱基以 A 或 G 结尾。 输入文件的示例为:

@HWI-D00461:137:C9H2FACXX:3:1101:1239:1968 1:N:0:GGCTAC
NTGTGTAATAGATTTTACTTTTGCCTTTAAGCCCAAGGTCCTGGACTTGAAACATCCAAGGGATGGAAAATGCCGTATAACAGGGTGGAAGAGAGATTTGA
+
#1=BDDFFHHHFHIJJJJJJJJJJJJJJJJJJJJJIJJIJJJJJHJIIJHGIJJJJJJIHJJBGHJHIIJJJHHHHFFFFEEEDD;?BACDDDA?@CDDDC
@HWI-D00461:137:C9H2FACXX:3:1101:1117:1968 1:N:0:GGCTAC
NAAAGTCTACCAATTATACTTAGTGTGAAGAGGTGGGAGTTAAATATGACTTCCATTAATAGTTTCATTGTTTGGAAAACAGAGGTAATTTTTGATACAGA
+
#1=DDDFDFHHHGHIIGJJJJHIJIHHDIHHIJGGEI@GFGHIHIJHEFHIIIIGIJGHHGECFGIDHGIHIIEGIIJHHEEFFF7?ACEECCBBDEDDDC

输出文件如下所示。

@HWI-D00461:137:C9H2FACXX:3:1101:1117:1968 1:N:0:GGCTAC
NAAAGTCTACCAATTATACTTAGTGTGAAGAGGTGGGAGTTAAATATGACTTCCATTAATAGTTTCATTGTTTGGAAAACAGAGGTAATTTTTGATACAGA
@HWI-D00461:137:C9H2FACXX:3:1101:1200:1972 1:N:0:GGCTAC
@HWI-D00461:137:C9H2FACXX:3:1101:1087:1973 1:N:0:GGCTAC
NTAATCCAACTAACTAAAAATAAAAAGATTCAAATAGGTACAGAAAACAATGAAGGTGTAGAGGTGAGAAATCAACAGGATGTTCAGAAGCCTGTGTATGA

尽管这包含了所有需要的数据,但它会提取出每一行 header(以“@”开头),这是不必要的。

如果我的代码是由一串以 A 或 G 结尾的碱基组成的,我该如何修改我的代码以仅提取 header 行?

问题是您要为每条记录添加 id,而不仅仅是您感兴趣的记录。一个快速的解决方案是保留 id 在一个变量中,只在需要时添加它:

filename = 'C0_GGCTAC_R1_no_adapter_trimming.fastq'
new_filename = filename[:-9] + '_new.fastq'

with open(filename) as f_obj:
    file_contents = f_obj.readlines()

extracted_lines = ''
line_count = 0

# Pull header and base lines
for line in file_contents:
    line_count += 1
    # Headers
    if line_count == 1:
        id_string = line
    # Reads ending in A
    elif line_count == 2 and line[-2] == 'A':
        extracted_lines += id_string
        extracted_lines += line
    # Reads ending in G
    elif line_count == 2 and line[-2] == 'G':
        extracted_lines += id_string
        extracted_lines += line
    # Reset counter
    elif line_count == 4:
        line_count = 0

with open(new_filename, 'w') as f_obj:
    f_obj.write(extracted_lines)
print(new_filename + " was created.")

我还得说这段代码效率不高,特别是在内存使用方面:您正在将一个(通常)非常大的文件读入内存,但您一次只需要一条记录。

次要问题是您的条件可以压缩,并且您可以使用模数来了解您属于哪种线型:

filename = 'C0_GGCTAC_R1_no_adapter_trimming.fastq'
new_filename = filename[:-9] + '_new.fastq'

with open(filename) as in_f_obj, open(new_filename, 'w') as out_f_obj:
    # Process the file
    line_count = 0
    for line in in_f_obj:
        line_count += 1

        # Extract the information for each record
        if line_count % 4 == 1:
            id_string = line
        elif line_count % 4 == 2:
            seq = line
        elif line_count % 4 == 3:
            extra = line
        elif line_count % 4 == 4:
            # Last part of the record. Here we have all the information
            # and we can decide if we want to output something
            # and what we want to output
            qual = line
            if seq[-2] == 'A' or seq[-2] == 'G'
                out_f_obj.write(id_string)
                out_f_obj.write(seq)

print(new_filename + " was created.")

在此代码中,您只在内存中保留一条记录。 line_count 变量包含实际处理的行数,并且您拥有输入中的所有数据,因此您可以很容易地更改输出。

我会添加一个额外的细节,我会在每个读取行中删除换行符,并在写入时根据需要添加它:

# Extract the information for each record
if line_count % 4 == 1:
    id_string = line.rstrip()
elif line_count % 4 == 2:
    seq = line.rstrip()
elif line_count % 4 == 3:
    extra = line.rstrip()
elif line_count % 4 == 4:
    # Last part of the record. Here we have all the information
    # and we can decide if we want to output something
    # and what we want to output
    qual = line.rstrip()
    if seq[-1] == 'A' or seq[-1] == 'G'
        out_f_obj.write("{}\n{}\n".format(id_string, seq))

这样一来,您的数据就干净了,输入文件中没有换行符格式。

我认为分 4 行而不是单行浏览文件会使您的任务更容易。至少假设真的总是有 4 条线彼此属于彼此。然后,您可以在添加相应的 header 行之前过滤所需的碱基,例如:

extracted_lines = []
for i in range(0, len(file_contents), 4):
    header, bases, comment1, comment2 = file_contents[i:i+4]
    if bases[-1] in ["A", "G"]:
        extracted_lines.append(header)
        extracted_lines.append(bases)