Python 打开文件时从文件中读取可变长度数据块的函数
Python function to read variable length blocks of data from file while open
我的数据文件包含许多时间步的数据,每个时间步的格式如下:
TIMESTEP PARTICLES
0.00500103 1262
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
....
每个块由 3 header 行和多行与时间步长相关的数据(第 2 行的 int)组成。与块关联的数据行数可以从 0 到 1000 万不等。每个块之间可能有一个空行,但有时会丢失。
我希望能够逐块读取文件,读取块后处理数据 - 文件很大(通常超过 200GB)并且一个时间步几乎可以轻松加载到内存中。
由于文件格式的原因,我认为编写一个读取 3 header 行、读取实际数据然后 return 用于数据处理的漂亮 numpy 数组的函数会很容易.
我习惯了 MATLAB,您可以在其中简单地读取块而不是在文件末尾。我不太确定如何使用 python.
执行此操作
我创建了以下函数来读取数据块:
def readBlock(f):
particleData = []
Timestep = []
numParticles = []
linesProcessed = 0
line = f.readline().strip()
if line.startswith('TIMESTEP'):
timestepHeaders = line.strip()
varData = f.readline().strip()
headerStrings = f.readline().strip().split(' ')
parts = varData.strip().split(' ')
Timestep = float(parts[0])
numParticles = int(parts[1])
while linesProcessed < numParticles:
particleData.append(tuple(f.readline().strip().split(' ')))
linesProcessed += 1
mydt = np.dtype([ ('ID',int),
('GROUP', int),
('Vol', float),
('Mass', float),
('Px', float),
('Py', float),
('Pz', float),
('Vx', float),
('Vy', float),
('Vz', float),
] )
particleData = np.array(particleData, dtype=mydt)
return Timestep, numParticles, particleData
我尝试运行这样的函数:
with open(fileOpenPath, 'r') as file:
startWallTime = time.clock()
Timestep, numParticles, particleData = readBlock(file)
print(Timestep)
## Do processing stuff here
print("Timestep Processed")
endWallTime = time.clock()
问题是这只从文件中读取第一个数据块并在那里停止 - 我不知道如何让它循环遍历文件直到它到达末尾并停止。
任何有关如何完成这项工作的建议都将非常有用。我认为我可以使用单行处理编写一种方法来执行此操作,并进行大量 if 检查以查看我是否处于时间步的末尾,但简单的功能似乎更容易和更清晰。
with 不会循环,它只会确保文件在之后正确关闭。
要循环,您需要在 with 语句之后添加一段时间(参见下面的代码)。但在执行此操作之前,您需要检查文件结尾 (EOF) 的 readBlock(f) 函数。将 line = f.readline().strip()
替换为以下代码:
line = f.readline()
if not line:
# EOF: returning None's.
return None, None, None
# We do the strip after the check.
# Otherwise a blank line "\n" might be interpreted as EOF.
line = line.strip()
因此,在 with 块中添加 while 循环并检查我们是否 None
返回指示 EOF,因此我们可以跳出 while 循环:
with open('file1') as file_handle:
while True:
startWallTime = time.clock()
Timestep, numParticles, particleData = readBlock(file_handle)
if Timestep == None:
break
print(Timestep)
## Do processing stuff here
print("Timestep Processed")
endWallTime = time.clock()
这里是一个简单的测试(它在第二次尝试时成功了!)
import numpy as np
with open('stack41091659.txt','rb') as f:
while f.readline(): # read the 'TIMESTEP PARTICLES' line
time, n = f.readline().strip().split()
n = int(n)
print(time, n)
ablock = [f.readline()] # block header line
for i in range(n):
ablock.append(f.readline())
print(len(ablock))
data = np.genfromtxt(ablock, dtype=None, names=True)
print(data.shape, data.dtype)
测试运行:
1458:~/mypy$ python3 stack41091659.py
b'0.00500103' 4
5
(4,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 3
4
(3,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 2
3
(2,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 4
5
(4,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
示例文件:
TIMESTEP PARTICLES
0.00500103 4
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 3
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 2
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 4
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
我正在使用 genfromtxt
对任何为其提供一行代码的内容感到满意的事实。在这里,我收集列表中的下一个块,并将其传递给 genfromtxt
.
并且使用 genfromtxt
的 max_rows
参数,我可以告诉它直接读取接下来的 n
行:
with open('stack41091659.txt','rb') as f:
while f.readline():
time, n = f.readline().strip().split()
n = int(n)
print(time, n)
data = np.genfromtxt(f, dtype=None, names=True, max_rows=n)
print(data.shape, len(data.dtype.names))
我没有考虑那个可选的空白行。可能可以在读取块的开始时将其压缩。 IE。 Readlines 直到我得到一个有效的 float int
对字符串。
您可以使用 numpy.genfromtxt
的 max_rows
参数:
with open("timesteps.dat", "rb") as f:
while True:
line = f.readline()
if len(line) == 0:
# End of file
break
# Skip blank lines
while len(line.strip()) == 0:
line = f.readline()
line2_fields = f.readline().split()
timestep = float(line2_fields[0])
particles = int(line2_fields[1])
data = np.genfromtxt(f, names=True, dtype=None, max_rows=particles)
print("Timestep:", timestep)
print("Particles:", particles)
print("Data:")
print(data)
print()
这是一个示例文件:
TIMESTEP PARTICLES
0.00500103 4
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 5
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
652 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
431 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
385 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
972 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 3
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
222 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
333 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
444 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
这是输出:
Timestep: 0.00500103
Particles: 4
Data:
[ (651, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903)
(430, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
(384, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
(971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]
Timestep: 0.00500103
Particles: 5
Data:
[ (971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)
(652, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903)
(431, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
(385, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
(972, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]
Timestep: 0.00500103
Particles: 3
Data:
[ (222, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
(333, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
(444, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]
我的数据文件包含许多时间步的数据,每个时间步的格式如下:
TIMESTEP PARTICLES
0.00500103 1262
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
....
每个块由 3 header 行和多行与时间步长相关的数据(第 2 行的 int)组成。与块关联的数据行数可以从 0 到 1000 万不等。每个块之间可能有一个空行,但有时会丢失。
我希望能够逐块读取文件,读取块后处理数据 - 文件很大(通常超过 200GB)并且一个时间步几乎可以轻松加载到内存中。
由于文件格式的原因,我认为编写一个读取 3 header 行、读取实际数据然后 return 用于数据处理的漂亮 numpy 数组的函数会很容易. 我习惯了 MATLAB,您可以在其中简单地读取块而不是在文件末尾。我不太确定如何使用 python.
执行此操作我创建了以下函数来读取数据块:
def readBlock(f):
particleData = []
Timestep = []
numParticles = []
linesProcessed = 0
line = f.readline().strip()
if line.startswith('TIMESTEP'):
timestepHeaders = line.strip()
varData = f.readline().strip()
headerStrings = f.readline().strip().split(' ')
parts = varData.strip().split(' ')
Timestep = float(parts[0])
numParticles = int(parts[1])
while linesProcessed < numParticles:
particleData.append(tuple(f.readline().strip().split(' ')))
linesProcessed += 1
mydt = np.dtype([ ('ID',int),
('GROUP', int),
('Vol', float),
('Mass', float),
('Px', float),
('Py', float),
('Pz', float),
('Vx', float),
('Vy', float),
('Vz', float),
] )
particleData = np.array(particleData, dtype=mydt)
return Timestep, numParticles, particleData
我尝试运行这样的函数:
with open(fileOpenPath, 'r') as file:
startWallTime = time.clock()
Timestep, numParticles, particleData = readBlock(file)
print(Timestep)
## Do processing stuff here
print("Timestep Processed")
endWallTime = time.clock()
问题是这只从文件中读取第一个数据块并在那里停止 - 我不知道如何让它循环遍历文件直到它到达末尾并停止。
任何有关如何完成这项工作的建议都将非常有用。我认为我可以使用单行处理编写一种方法来执行此操作,并进行大量 if 检查以查看我是否处于时间步的末尾,但简单的功能似乎更容易和更清晰。
with 不会循环,它只会确保文件在之后正确关闭。
要循环,您需要在 with 语句之后添加一段时间(参见下面的代码)。但在执行此操作之前,您需要检查文件结尾 (EOF) 的 readBlock(f) 函数。将 line = f.readline().strip()
替换为以下代码:
line = f.readline()
if not line:
# EOF: returning None's.
return None, None, None
# We do the strip after the check.
# Otherwise a blank line "\n" might be interpreted as EOF.
line = line.strip()
因此,在 with 块中添加 while 循环并检查我们是否 None
返回指示 EOF,因此我们可以跳出 while 循环:
with open('file1') as file_handle:
while True:
startWallTime = time.clock()
Timestep, numParticles, particleData = readBlock(file_handle)
if Timestep == None:
break
print(Timestep)
## Do processing stuff here
print("Timestep Processed")
endWallTime = time.clock()
这里是一个简单的测试(它在第二次尝试时成功了!)
import numpy as np
with open('stack41091659.txt','rb') as f:
while f.readline(): # read the 'TIMESTEP PARTICLES' line
time, n = f.readline().strip().split()
n = int(n)
print(time, n)
ablock = [f.readline()] # block header line
for i in range(n):
ablock.append(f.readline())
print(len(ablock))
data = np.genfromtxt(ablock, dtype=None, names=True)
print(data.shape, data.dtype)
测试运行:
1458:~/mypy$ python3 stack41091659.py
b'0.00500103' 4
5
(4,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 3
4
(3,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 2
3
(2,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 4
5
(4,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
示例文件:
TIMESTEP PARTICLES
0.00500103 4
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 3
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 2
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 4
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
我正在使用 genfromtxt
对任何为其提供一行代码的内容感到满意的事实。在这里,我收集列表中的下一个块,并将其传递给 genfromtxt
.
并且使用 genfromtxt
的 max_rows
参数,我可以告诉它直接读取接下来的 n
行:
with open('stack41091659.txt','rb') as f:
while f.readline():
time, n = f.readline().strip().split()
n = int(n)
print(time, n)
data = np.genfromtxt(f, dtype=None, names=True, max_rows=n)
print(data.shape, len(data.dtype.names))
我没有考虑那个可选的空白行。可能可以在读取块的开始时将其压缩。 IE。 Readlines 直到我得到一个有效的 float int
对字符串。
您可以使用 numpy.genfromtxt
的 max_rows
参数:
with open("timesteps.dat", "rb") as f:
while True:
line = f.readline()
if len(line) == 0:
# End of file
break
# Skip blank lines
while len(line.strip()) == 0:
line = f.readline()
line2_fields = f.readline().split()
timestep = float(line2_fields[0])
particles = int(line2_fields[1])
data = np.genfromtxt(f, names=True, dtype=None, max_rows=particles)
print("Timestep:", timestep)
print("Particles:", particles)
print("Data:")
print(data)
print()
这是一个示例文件:
TIMESTEP PARTICLES
0.00500103 4
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 5
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
652 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
431 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
385 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
972 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 3
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
222 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
333 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
444 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
这是输出:
Timestep: 0.00500103
Particles: 4
Data:
[ (651, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903)
(430, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
(384, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
(971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]
Timestep: 0.00500103
Particles: 5
Data:
[ (971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)
(652, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903)
(431, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
(385, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
(972, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]
Timestep: 0.00500103
Particles: 3
Data:
[ (222, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
(333, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
(444, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]