使用 Python 和 Pandas 拆分文本文件中的数据
Use Python and Pandas to split data in a text file
我有以下来自 CFD 模拟的数据:
Average value for X = 0.5080000265E-0003 to 0.2489200234E-0001
Z = -.3141592741E+0001
Time = 0.7000032425E+0001
Y P_g
0.1511904760E-0002 0.2565604063E+0006
0.4535714164E-0002 0.2565349844E+0006
0.7559523918E-0002 0.2565098906E+0006
0.1058333274E-0001 0.2564848125E+0006
0.1360714249E-0001 0.2564597656E+0006
0.1663095318E-0001 0.2564346563E+0006
0.1965476200E-0001 0.2564095625E+0006
... ...
... ...
0.1259419441E+0001 0.2549983125E+0006
0.1262443304E+0001 0.2549983125E+0006
0.1265467167E+0001 0.2549983125E+0006
0.1268491030E+0001 0.2549982656E+0006
Time = 0.7010014057E+0001
Y P_g
0.1511904760E-0002 0.2565604063E+0006
0.4535714164E-0002 0.2565349844E+0006
0.7559523918E-0002 0.2565098906E+0006
0.1058333274E-0001 0.2564848125E+0006
... ...
... ...
0.1259419441E+0001 0.2549983125E+0006
0.1262443304E+0001 0.2549983125E+0006
0.1265467167E+0001 0.2549983125E+0006
0.1268491030E+0001 0.2549982656E+0006
Time = 0.7020006657E+0001
Y P_g
0.1511904760E-0002 0.2565604063E+0006
0.1058333274E-0001 0.2564848125E+0006
... ...
从上面的例子可以看出,数据被标记为 Time
的时间步长分割成几个垂直部分。在每个部分中,Y
不会改变,但 P_g
会改变。要绘制数据,我需要将每个部分中的 P_g
列在下一列中。例如,这就是我需要重新创建数据的方式:
Y 0.7000032425E+1 0.7020006657E+1 ...
0.1511904760E-0002 0.2565604063E+0006 0.2549982656E+0006 ...
0.4535714164E-0002 0.2565349844E+0006 0.2549982656E+0006 ...
0.7559523918E-0002 0.2565098906E+0006 0.2549982656E+0006 ...
0.1058333274E-0001 0.2564848125E+0006 0.2549982656E+0006 ...
0.1360714249E-0001 0.2564597656E+0006 0.2549982656E+0006 ...
使用 Pandas,我可以从文本文件中读取数据并创建一个新的数据框,其中 Y
值作为索引(行),Time
值作为列:
import pandas as pd
# Read in data from text file
# -------------------------------------------------------------------------
# data frame from text file contents, skip first 4 rows, separate by variable
# white space, no header
df = pd.read_table('ROP_s_SD.dat', skiprows=4, sep='\s*', header=None)
# Time data
# -------------------------------------------------------------------------
# data frame of the rows that contain the Time string
dftime = df.loc[df.ix[:,0].str.contains('Time')]
t = dftime[2].tolist() # time list
idx = dftime.index # index of rows containing Time string
# Y data
# -------------------------------------------------------------------------
# grab values for y to create index for new data frame
ido = idx[0]+2 # index of first y value
idf = idx[1] # index of last y value
y = [] # empty list to store y values
for i in range(ido, idf): # iterate through first section of y values
v = df.ix[i, 0] # get y value from data frame
y.append(float(v)) # add y value to y list
# New data frame
# ------------------------------------------------------------------------
# empty data frame with y as index and t as columns
dfnew = pd.DataFrame(None, index=y, columns=t)
print('dfnew is \n', dfnew.head())
空数据框的头部,dfnew.head()
如下所示:
7.000032 7.010014 7.020007 7.030043 7.040020 7.050035 7.060043
0.001512 NaN NaN NaN NaN NaN NaN NaN
0.004536 NaN NaN NaN NaN NaN NaN NaN
0.007560 NaN NaN NaN NaN NaN NaN NaN
0.010583 NaN NaN NaN NaN NaN NaN NaN
0.013607 NaN NaN NaN NaN NaN NaN NaN
7.070004 7.080036 7.090022 ... 7.650011 7.660032 7.670026
0.001512 NaN NaN NaN ... NaN NaN NaN
0.004536 NaN NaN NaN ... NaN NaN NaN
0.007560 NaN NaN NaN ... NaN NaN NaN
0.010583 NaN NaN NaN ... NaN NaN NaN
0.013607 NaN NaN NaN ... NaN NaN NaN
7.680044 7.690029 7.700008 7.710012 7.720014 7.730019 7.740026
0.001512 NaN NaN NaN NaN NaN NaN NaN
0.004536 NaN NaN NaN NaN NaN NaN NaN
0.007560 NaN NaN NaN NaN NaN NaN NaN
0.010583 NaN NaN NaN NaN NaN NaN NaN
0.013607 NaN NaN NaN NaN NaN NaN NaN
[5 rows x 75 columns]
每列中的 NaN
应包含来自该特定 Time
部分的 P_g
值。如何将每个部分的 P_g
值添加到各自的列中?
我正在阅读的文本文件可以下载here。
两件事。首先,也许您可以考虑如何将其简化为二维电子表格。每行应包含哪些列?我建议每行应包含 Time
、Y
和 P_g
。也许这可以告知您处理时髦输入格式的策略。
其次,您试图绘制 P_g
v.s 的 Y
个值。 Time
?您的数据似乎有 3 个变量——您需要减少到 2 个维度才能绘制二维图。您要绘制特定 Time
值的 P_g
的平均值吗?或者你想要一个 3d 图,在那里你绘制 Y
v.s。 P_g
每个 Time
值?假设您采用我上面建议的 row/col 结构,那么使用 pandas 可以轻松完成其中任何一个。查看 pandas groupby
功能。 Here's more detail on that.
编辑:您已经澄清了我的两个问题。试试这个:
import pandas, sys, numpy
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
# main dataframe
df = pandas.DataFrame(columns=['Time','Y','P_g'])
text = open('ROP_s_SD.dat','r').read()
chunks = text.split("Time = ")
# ignore first chunk
chunks = chunks[1:]
for chunk in chunks:
time_str, rest_str = chunk.split('\n',1)
time = float(time_str)
chunk_df = pandas.DataFrame.from_csv(StringIO(rest_str), sep=r'\s*', index_col=False)
chunk_df['Time'] = time
# add new content to main dataframe
df = df.append(chunk_df)
# you should now have a DataFrame with columns 'Time','Y','P_g'
assert sorted(df.columns) == ['P_g', 'Time', 'Y']
# iterate over unique values of time
times = sorted(list(set(df['Time'])))
assert len(times) == len(chunks)
for i,time in enumerate(times):
chunk_data = df[df['Time'] == time]
# plot or do whatever you'd like with each segment
means = numpy.mean(chunk_data)
stds = numpy.std(chunk_data)
print 'Data for time %d (%0.4f): ' %(i, time)
print means, stds
看起来您已经完成了大部分艰苦的工作...以下几行将完成对您的 DataFrame 的解析:
# Add one more element to idx for correct indexing on the last column
idx = list(idx)
idx.append(len(df))
# Loop over the idx locations to fill the columns
for i in range(len(dfnew.columns)):
dfnew.iloc[:, i] = df.iloc[idx[i]+2:idx[i+1], 1].values
dfnew
的标题现在前 3 列是这样的:
7.000032 7.010014 7.020007
0.001512 0.2565604063E+0006 0.2565604063E+0006 0.2565604063E+0006
0.004536 0.2565349844E+0006 0.2565349844E+0006 0.2565349844E+0006
0.007560 0.2565098906E+0006 0.2565098906E+0006 0.2565098906E+0006
0.010583 0.2564848125E+0006 0.2564848125E+0006 0.2564848125E+0006
0.013607 0.2564597656E+0006 0.2564597656E+0006 0.2564597656E+0006
您有很多元素,因此查看数据的最佳方式可能是二维的:
data = dfnew.astype(float).values
extent = [float(dfnew.columns[0]),
float(dfnew.columns[-1]),
float(dfnew.index[0]),
float(dfnew.index[-1])]
import matplotlib.pyplot as plt
plt.imshow(data, extent=extent, origin='lower')
plt.xlabel('Time')
plt.ylabel('Y')
顺便说一句,看起来您的示例文件中每次 P_g 的所有值都是相同的...
我有以下来自 CFD 模拟的数据:
Average value for X = 0.5080000265E-0003 to 0.2489200234E-0001
Z = -.3141592741E+0001
Time = 0.7000032425E+0001
Y P_g
0.1511904760E-0002 0.2565604063E+0006
0.4535714164E-0002 0.2565349844E+0006
0.7559523918E-0002 0.2565098906E+0006
0.1058333274E-0001 0.2564848125E+0006
0.1360714249E-0001 0.2564597656E+0006
0.1663095318E-0001 0.2564346563E+0006
0.1965476200E-0001 0.2564095625E+0006
... ...
... ...
0.1259419441E+0001 0.2549983125E+0006
0.1262443304E+0001 0.2549983125E+0006
0.1265467167E+0001 0.2549983125E+0006
0.1268491030E+0001 0.2549982656E+0006
Time = 0.7010014057E+0001
Y P_g
0.1511904760E-0002 0.2565604063E+0006
0.4535714164E-0002 0.2565349844E+0006
0.7559523918E-0002 0.2565098906E+0006
0.1058333274E-0001 0.2564848125E+0006
... ...
... ...
0.1259419441E+0001 0.2549983125E+0006
0.1262443304E+0001 0.2549983125E+0006
0.1265467167E+0001 0.2549983125E+0006
0.1268491030E+0001 0.2549982656E+0006
Time = 0.7020006657E+0001
Y P_g
0.1511904760E-0002 0.2565604063E+0006
0.1058333274E-0001 0.2564848125E+0006
... ...
从上面的例子可以看出,数据被标记为 Time
的时间步长分割成几个垂直部分。在每个部分中,Y
不会改变,但 P_g
会改变。要绘制数据,我需要将每个部分中的 P_g
列在下一列中。例如,这就是我需要重新创建数据的方式:
Y 0.7000032425E+1 0.7020006657E+1 ...
0.1511904760E-0002 0.2565604063E+0006 0.2549982656E+0006 ...
0.4535714164E-0002 0.2565349844E+0006 0.2549982656E+0006 ...
0.7559523918E-0002 0.2565098906E+0006 0.2549982656E+0006 ...
0.1058333274E-0001 0.2564848125E+0006 0.2549982656E+0006 ...
0.1360714249E-0001 0.2564597656E+0006 0.2549982656E+0006 ...
使用 Pandas,我可以从文本文件中读取数据并创建一个新的数据框,其中 Y
值作为索引(行),Time
值作为列:
import pandas as pd
# Read in data from text file
# -------------------------------------------------------------------------
# data frame from text file contents, skip first 4 rows, separate by variable
# white space, no header
df = pd.read_table('ROP_s_SD.dat', skiprows=4, sep='\s*', header=None)
# Time data
# -------------------------------------------------------------------------
# data frame of the rows that contain the Time string
dftime = df.loc[df.ix[:,0].str.contains('Time')]
t = dftime[2].tolist() # time list
idx = dftime.index # index of rows containing Time string
# Y data
# -------------------------------------------------------------------------
# grab values for y to create index for new data frame
ido = idx[0]+2 # index of first y value
idf = idx[1] # index of last y value
y = [] # empty list to store y values
for i in range(ido, idf): # iterate through first section of y values
v = df.ix[i, 0] # get y value from data frame
y.append(float(v)) # add y value to y list
# New data frame
# ------------------------------------------------------------------------
# empty data frame with y as index and t as columns
dfnew = pd.DataFrame(None, index=y, columns=t)
print('dfnew is \n', dfnew.head())
空数据框的头部,dfnew.head()
如下所示:
7.000032 7.010014 7.020007 7.030043 7.040020 7.050035 7.060043
0.001512 NaN NaN NaN NaN NaN NaN NaN
0.004536 NaN NaN NaN NaN NaN NaN NaN
0.007560 NaN NaN NaN NaN NaN NaN NaN
0.010583 NaN NaN NaN NaN NaN NaN NaN
0.013607 NaN NaN NaN NaN NaN NaN NaN
7.070004 7.080036 7.090022 ... 7.650011 7.660032 7.670026
0.001512 NaN NaN NaN ... NaN NaN NaN
0.004536 NaN NaN NaN ... NaN NaN NaN
0.007560 NaN NaN NaN ... NaN NaN NaN
0.010583 NaN NaN NaN ... NaN NaN NaN
0.013607 NaN NaN NaN ... NaN NaN NaN
7.680044 7.690029 7.700008 7.710012 7.720014 7.730019 7.740026
0.001512 NaN NaN NaN NaN NaN NaN NaN
0.004536 NaN NaN NaN NaN NaN NaN NaN
0.007560 NaN NaN NaN NaN NaN NaN NaN
0.010583 NaN NaN NaN NaN NaN NaN NaN
0.013607 NaN NaN NaN NaN NaN NaN NaN
[5 rows x 75 columns]
每列中的 NaN
应包含来自该特定 Time
部分的 P_g
值。如何将每个部分的 P_g
值添加到各自的列中?
我正在阅读的文本文件可以下载here。
两件事。首先,也许您可以考虑如何将其简化为二维电子表格。每行应包含哪些列?我建议每行应包含 Time
、Y
和 P_g
。也许这可以告知您处理时髦输入格式的策略。
其次,您试图绘制 P_g
v.s 的 Y
个值。 Time
?您的数据似乎有 3 个变量——您需要减少到 2 个维度才能绘制二维图。您要绘制特定 Time
值的 P_g
的平均值吗?或者你想要一个 3d 图,在那里你绘制 Y
v.s。 P_g
每个 Time
值?假设您采用我上面建议的 row/col 结构,那么使用 pandas 可以轻松完成其中任何一个。查看 pandas groupby
功能。 Here's more detail on that.
编辑:您已经澄清了我的两个问题。试试这个:
import pandas, sys, numpy
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
# main dataframe
df = pandas.DataFrame(columns=['Time','Y','P_g'])
text = open('ROP_s_SD.dat','r').read()
chunks = text.split("Time = ")
# ignore first chunk
chunks = chunks[1:]
for chunk in chunks:
time_str, rest_str = chunk.split('\n',1)
time = float(time_str)
chunk_df = pandas.DataFrame.from_csv(StringIO(rest_str), sep=r'\s*', index_col=False)
chunk_df['Time'] = time
# add new content to main dataframe
df = df.append(chunk_df)
# you should now have a DataFrame with columns 'Time','Y','P_g'
assert sorted(df.columns) == ['P_g', 'Time', 'Y']
# iterate over unique values of time
times = sorted(list(set(df['Time'])))
assert len(times) == len(chunks)
for i,time in enumerate(times):
chunk_data = df[df['Time'] == time]
# plot or do whatever you'd like with each segment
means = numpy.mean(chunk_data)
stds = numpy.std(chunk_data)
print 'Data for time %d (%0.4f): ' %(i, time)
print means, stds
看起来您已经完成了大部分艰苦的工作...以下几行将完成对您的 DataFrame 的解析:
# Add one more element to idx for correct indexing on the last column
idx = list(idx)
idx.append(len(df))
# Loop over the idx locations to fill the columns
for i in range(len(dfnew.columns)):
dfnew.iloc[:, i] = df.iloc[idx[i]+2:idx[i+1], 1].values
dfnew
的标题现在前 3 列是这样的:
7.000032 7.010014 7.020007
0.001512 0.2565604063E+0006 0.2565604063E+0006 0.2565604063E+0006
0.004536 0.2565349844E+0006 0.2565349844E+0006 0.2565349844E+0006
0.007560 0.2565098906E+0006 0.2565098906E+0006 0.2565098906E+0006
0.010583 0.2564848125E+0006 0.2564848125E+0006 0.2564848125E+0006
0.013607 0.2564597656E+0006 0.2564597656E+0006 0.2564597656E+0006
您有很多元素,因此查看数据的最佳方式可能是二维的:
data = dfnew.astype(float).values
extent = [float(dfnew.columns[0]),
float(dfnew.columns[-1]),
float(dfnew.index[0]),
float(dfnew.index[-1])]
import matplotlib.pyplot as plt
plt.imshow(data, extent=extent, origin='lower')
plt.xlabel('Time')
plt.ylabel('Y')
顺便说一句,看起来您的示例文件中每次 P_g 的所有值都是相同的...