遍历巨大的文本文件:使用 python 读取两个重复模式之间的块
iterate over huge text file: read chunks between two reoccurring patterns using python
我需要检查生物序列数据库 GeneBank 的一个巨大的(20GB,内存太大)文本文件)并为每个数据库条目提取相同的信息。每个条目都以 LOCUS XYZ some more text
行开始,并以 //
行结束。例如:
LOCUS 123 some more text
many lines of some more text
many lines of some more text
many lines of some more text
//
LOCUS 231 some more text
many lines of some more text
many lines of some more text
many lines of some more text
//
LOCUS 312 some more text
many lines of some more text
many lines of some more text
many lines of some more text
//
现在,有没有办法告诉 python 迭代地将该文件的相应 3 个块读取到某个变量 var 中。更准确地说:
迭代 1:var=
LOCUS 123 some more text
many lines of some more text
many lines of some more text
many lines of some more text
//
迭代 2:var=
LOCUS 231 some more text
many lines of some more text
many lines of some more text
many lines of some more text
//
迭代 3:var=
LOCUS 312 some more text
many lines of some more text
many lines of some more text
many lines of some more text
//
提前致谢,祝即将到来的假期一切顺利
假设我们有以下文本文件:
LOCUS 421 bla bla ba
Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
Duis eu erat orci. Quisque
nec augue ultricies, dignissim
neque id, feugiat risus.
//
LOCUS 421 blabla
Nullam pulvinar quis ante
at condimentum.
//
我们可以做到:
is_processing = True
pf = open("somefile.txt", "r")
# Handles chunks
while True:
first_chunk_line = True
chunk_lines = []
# Handles one chunk
while True:
data_line = pf.readline()
# detect the end of the file
if data_line == '':
is_processing = False
break
# Detect first line
if first_chunk_line:
if "LOCUS" not in data_line:
raise Exception("Data file is malformed!")
first_chunk_line = False
continue # don't process the line
# Detect end of locus / chunk
if data_line.strip() == "//":
break
# if it is neither a first line, and end line nor the end of the file
# then it must be a chunk line holding precious DNA information
chunk_lines.append(data_line)
# end the while loop
if not is_processing:
break
# do something with one chunk lines
print(chunk_lines)
我需要检查生物序列数据库 GeneBank 的一个巨大的(20GB,内存太大)文本文件)并为每个数据库条目提取相同的信息。每个条目都以 LOCUS XYZ some more text
行开始,并以 //
行结束。例如:
LOCUS 123 some more text
many lines of some more text
many lines of some more text
many lines of some more text
//
LOCUS 231 some more text
many lines of some more text
many lines of some more text
many lines of some more text
//
LOCUS 312 some more text
many lines of some more text
many lines of some more text
many lines of some more text
//
现在,有没有办法告诉 python 迭代地将该文件的相应 3 个块读取到某个变量 var 中。更准确地说:
迭代 1:var=
LOCUS 123 some more text
many lines of some more text
many lines of some more text
many lines of some more text
//
迭代 2:var=
LOCUS 231 some more text
many lines of some more text
many lines of some more text
many lines of some more text
//
迭代 3:var=
LOCUS 312 some more text
many lines of some more text
many lines of some more text
many lines of some more text
//
提前致谢,祝即将到来的假期一切顺利
假设我们有以下文本文件:
LOCUS 421 bla bla ba
Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
Duis eu erat orci. Quisque
nec augue ultricies, dignissim
neque id, feugiat risus.
//
LOCUS 421 blabla
Nullam pulvinar quis ante
at condimentum.
//
我们可以做到:
is_processing = True
pf = open("somefile.txt", "r")
# Handles chunks
while True:
first_chunk_line = True
chunk_lines = []
# Handles one chunk
while True:
data_line = pf.readline()
# detect the end of the file
if data_line == '':
is_processing = False
break
# Detect first line
if first_chunk_line:
if "LOCUS" not in data_line:
raise Exception("Data file is malformed!")
first_chunk_line = False
continue # don't process the line
# Detect end of locus / chunk
if data_line.strip() == "//":
break
# if it is neither a first line, and end line nor the end of the file
# then it must be a chunk line holding precious DNA information
chunk_lines.append(data_line)
# end the while loop
if not is_processing:
break
# do something with one chunk lines
print(chunk_lines)