如何根据前面的line/string in Python 得到字符串中字符的位置?
How do I get the position of characters in a string according to the previous line/string in Python?
我必须在 pandas 数据帧中解析包含我需要 put/sort 数据的文件。下面是我解析的部分文件的示例:
TEST# RESULT UNITS LOWER UPPER ALARM TEST NAME
------ -------------- -------- -------------- -------------- ----- ----------
1.0 1234 TESTNAME1
1.1 H647333 TESTNAME2
1.2 30 C TEMPTOTAL
1.3 1 cnt CEREAL
1.4 364003 cnt POINTNUM
1.5 20200505 cnt Date
1.6 174143 cnt Time
1.7 2.020051e+007 cnt DateTime
1.8 123 cnt SMT
1.9 23.16 C TEMP1
1.10 23.55 C 123 TEMP2
1.11 22.88 C -23 TEMP3
1.12 22.86 C TEMP4
1.13 1.406 Meter -1.450 1.500 DIST1
1.14 0.718 Meter -0.800 0.350 FAIL DIST2
我的问题是:我如何解释有下限但没有上限或有上限但没有下限?
注意:我的实际文本文件没有这种情况,但我的 application\project 调用说明了可能发生这种情况的实例。
我如何检查每一行如下:
line = file_object.readline()
while line.strip():
# extract data from line and format all info in one list
xline = line.strip().split()
# the length of the info list of the line read
# is correlated to the data
if len(xline) == 3:
number = xline[0]
results = xline[1]
testname = xline[2]
units = None
lower = None
upper = None
# alarm = None
elif len(xline) == 4:
number = xline[0]
results = xline[1]
units = xline[2]
testname = xline[3]
lower = None
upper = None
# alarm = None
elif len(xline) == 6:
number = xline[0]
results = xline[1]
units = xline[2]
lower = xline[3]
upper = xline[4]
testname = xline[5]
# alarm = None
elif len(xline) == 7:
number = xline[0]
results = xline[1]
units = xline[2]
lower = xline[3]
upper = xline[4]
# alarm = xline[5]
testname = xline[6]
# create a dictionary containing this row of data
row = {
'Test #': number,
'Result': results,
'Units': units,
'Lower Limit': lower,
'Upper Limit': upper,
# 'Alarm': alarm,
'Test Name': testname,
}
data.append(row)
line = file_object.readline()
我的想法是将读取的每一行数据与“TEST# RESULT UNITS LOWER UPPER ALARM TEST NAME”行 header 位置进行比较,但我不知道该怎么做。如果有人能给我指出一个可行的方向,那就太好了!
EDIT: 文件不只是上面显示的 table 格式。我的文件在文件开头有一大堆交错的块文本。以及多个“tables”,它们之间有交错的块文本。
你可以使用,pd.read_fwf
:
df = pd.read_fwf(inputtxt,'infer')
输出:
TEST# RESULT UNITS LOWER UPPER ALARM TEST NAME
0 ------ -------------- -------- -------------- -------------- ----- ----------
1 1.0 1234 NaN NaN NaN NaN TESTNAME1
2 1.1 H647333 NaN NaN NaN NaN TESTNAME2
3 1.2 30 C NaN NaN NaN TEMPTOTAL
4 1.3 1 cnt NaN NaN NaN CEREAL
5 1.4 364003 cnt NaN NaN NaN POINTNUM
6 1.5 20200505 cnt NaN NaN NaN Date
7 1.6 174143 cnt NaN NaN NaN Time
8 1.7 2.020051e+007 cnt NaN NaN NaN DateTime
9 1.8 123 cnt NaN NaN NaN SMT
10 1.9 23.16 C NaN NaN NaN TEMP1
11 1.10 23.55 C NaN 123 NaN TEMP2
12 1.11 22.88 C -23 NaN NaN TEMP3
13 1.12 22.86 C NaN NaN NaN TEMP4
14 1.13 1.406 Meter -1.450 1.500 NaN DIST1
15 1.14 0.718 Meter -0.800 0.350 FAIL DIST2
并且,您可以删除索引 0 以获取虚线:
df = df.drop(0)
输出:
TEST# RESULT UNITS LOWER UPPER ALARM TEST NAME
1 1.0 1234 NaN NaN NaN NaN TESTNAME1
2 1.1 H647333 NaN NaN NaN NaN TESTNAME2
3 1.2 30 C NaN NaN NaN TEMPTOTAL
4 1.3 1 cnt NaN NaN NaN CEREAL
5 1.4 364003 cnt NaN NaN NaN POINTNUM
6 1.5 20200505 cnt NaN NaN NaN Date
7 1.6 174143 cnt NaN NaN NaN Time
8 1.7 2.020051e+007 cnt NaN NaN NaN DateTime
9 1.8 123 cnt NaN NaN NaN SMT
10 1.9 23.16 C NaN NaN NaN TEMP1
11 1.10 23.55 C NaN 123 NaN TEMP2
12 1.11 22.88 C -23 NaN NaN TEMP3
13 1.12 22.86 C NaN NaN NaN TEMP4
14 1.13 1.406 Meter -1.450 1.500 NaN DIST1
15 1.14 0.718 Meter -0.800 0.350 FAIL DIST2
一个 non-pandas 解决方案,从 header 推断字段宽度,但使用 pandas :
import re
with open('table.txt') as fin:
next(fin) # skip headers
# capture start/end of each set of dashed lines to get field widths
spans = [m.span() for m in re.finditer(r'-+',next(fin))]
for line in fin:
# break lines on the field widths and strip leading/trailing white sapce
column = [line[start:end].strip() for start,end in spans]
print(column)
输出:
['1.0', '1234', '', '', '', '', 'TESTNAME1']
['1.1', 'H647333', '', '', '', '', 'TESTNAME2']
['1.2', '30', 'C', '', '', '', 'TEMPTOTAL']
['1.3', '1', 'cnt', '', '', '', 'CEREAL']
['1.4', '364003', 'cnt', '', '', '', 'POINTNUM']
['1.5', '20200505', 'cnt', '', '', '', 'Date']
['1.6', '174143', 'cnt', '', '', '', 'Time']
['1.7', '2.020051e+007', 'cnt', '', '', '', 'DateTime']
['1.8', '123', 'cnt', '', '', '', 'SMT']
['1.9', '23.16', 'C', '', '', '', 'TEMP1']
['1.10', '23.55', 'C', '', '123', '', 'TEMP2']
['1.11', '22.88', 'C', '-23', '', '', 'TEMP3']
['1.12', '22.86', 'C', '', '', '', 'TEMP4']
['1.13', '1.406', 'Meter', '-1.450', '1.500', '', 'DIST1']
['1.14', '0.718', 'Meter', '-0.800', '0.350', 'FAIL', 'DIST2']
我必须在 pandas 数据帧中解析包含我需要 put/sort 数据的文件。下面是我解析的部分文件的示例:
TEST# RESULT UNITS LOWER UPPER ALARM TEST NAME
------ -------------- -------- -------------- -------------- ----- ----------
1.0 1234 TESTNAME1
1.1 H647333 TESTNAME2
1.2 30 C TEMPTOTAL
1.3 1 cnt CEREAL
1.4 364003 cnt POINTNUM
1.5 20200505 cnt Date
1.6 174143 cnt Time
1.7 2.020051e+007 cnt DateTime
1.8 123 cnt SMT
1.9 23.16 C TEMP1
1.10 23.55 C 123 TEMP2
1.11 22.88 C -23 TEMP3
1.12 22.86 C TEMP4
1.13 1.406 Meter -1.450 1.500 DIST1
1.14 0.718 Meter -0.800 0.350 FAIL DIST2
我的问题是:我如何解释有下限但没有上限或有上限但没有下限?
注意:我的实际文本文件没有这种情况,但我的 application\project 调用说明了可能发生这种情况的实例。
我如何检查每一行如下:
line = file_object.readline()
while line.strip():
# extract data from line and format all info in one list
xline = line.strip().split()
# the length of the info list of the line read
# is correlated to the data
if len(xline) == 3:
number = xline[0]
results = xline[1]
testname = xline[2]
units = None
lower = None
upper = None
# alarm = None
elif len(xline) == 4:
number = xline[0]
results = xline[1]
units = xline[2]
testname = xline[3]
lower = None
upper = None
# alarm = None
elif len(xline) == 6:
number = xline[0]
results = xline[1]
units = xline[2]
lower = xline[3]
upper = xline[4]
testname = xline[5]
# alarm = None
elif len(xline) == 7:
number = xline[0]
results = xline[1]
units = xline[2]
lower = xline[3]
upper = xline[4]
# alarm = xline[5]
testname = xline[6]
# create a dictionary containing this row of data
row = {
'Test #': number,
'Result': results,
'Units': units,
'Lower Limit': lower,
'Upper Limit': upper,
# 'Alarm': alarm,
'Test Name': testname,
}
data.append(row)
line = file_object.readline()
我的想法是将读取的每一行数据与“TEST# RESULT UNITS LOWER UPPER ALARM TEST NAME”行 header 位置进行比较,但我不知道该怎么做。如果有人能给我指出一个可行的方向,那就太好了!
EDIT: 文件不只是上面显示的 table 格式。我的文件在文件开头有一大堆交错的块文本。以及多个“tables”,它们之间有交错的块文本。
你可以使用,pd.read_fwf
:
df = pd.read_fwf(inputtxt,'infer')
输出:
TEST# RESULT UNITS LOWER UPPER ALARM TEST NAME
0 ------ -------------- -------- -------------- -------------- ----- ----------
1 1.0 1234 NaN NaN NaN NaN TESTNAME1
2 1.1 H647333 NaN NaN NaN NaN TESTNAME2
3 1.2 30 C NaN NaN NaN TEMPTOTAL
4 1.3 1 cnt NaN NaN NaN CEREAL
5 1.4 364003 cnt NaN NaN NaN POINTNUM
6 1.5 20200505 cnt NaN NaN NaN Date
7 1.6 174143 cnt NaN NaN NaN Time
8 1.7 2.020051e+007 cnt NaN NaN NaN DateTime
9 1.8 123 cnt NaN NaN NaN SMT
10 1.9 23.16 C NaN NaN NaN TEMP1
11 1.10 23.55 C NaN 123 NaN TEMP2
12 1.11 22.88 C -23 NaN NaN TEMP3
13 1.12 22.86 C NaN NaN NaN TEMP4
14 1.13 1.406 Meter -1.450 1.500 NaN DIST1
15 1.14 0.718 Meter -0.800 0.350 FAIL DIST2
并且,您可以删除索引 0 以获取虚线:
df = df.drop(0)
输出:
TEST# RESULT UNITS LOWER UPPER ALARM TEST NAME
1 1.0 1234 NaN NaN NaN NaN TESTNAME1
2 1.1 H647333 NaN NaN NaN NaN TESTNAME2
3 1.2 30 C NaN NaN NaN TEMPTOTAL
4 1.3 1 cnt NaN NaN NaN CEREAL
5 1.4 364003 cnt NaN NaN NaN POINTNUM
6 1.5 20200505 cnt NaN NaN NaN Date
7 1.6 174143 cnt NaN NaN NaN Time
8 1.7 2.020051e+007 cnt NaN NaN NaN DateTime
9 1.8 123 cnt NaN NaN NaN SMT
10 1.9 23.16 C NaN NaN NaN TEMP1
11 1.10 23.55 C NaN 123 NaN TEMP2
12 1.11 22.88 C -23 NaN NaN TEMP3
13 1.12 22.86 C NaN NaN NaN TEMP4
14 1.13 1.406 Meter -1.450 1.500 NaN DIST1
15 1.14 0.718 Meter -0.800 0.350 FAIL DIST2
一个 non-pandas 解决方案,从 header 推断字段宽度,但使用 pandas :
import re
with open('table.txt') as fin:
next(fin) # skip headers
# capture start/end of each set of dashed lines to get field widths
spans = [m.span() for m in re.finditer(r'-+',next(fin))]
for line in fin:
# break lines on the field widths and strip leading/trailing white sapce
column = [line[start:end].strip() for start,end in spans]
print(column)
输出:
['1.0', '1234', '', '', '', '', 'TESTNAME1']
['1.1', 'H647333', '', '', '', '', 'TESTNAME2']
['1.2', '30', 'C', '', '', '', 'TEMPTOTAL']
['1.3', '1', 'cnt', '', '', '', 'CEREAL']
['1.4', '364003', 'cnt', '', '', '', 'POINTNUM']
['1.5', '20200505', 'cnt', '', '', '', 'Date']
['1.6', '174143', 'cnt', '', '', '', 'Time']
['1.7', '2.020051e+007', 'cnt', '', '', '', 'DateTime']
['1.8', '123', 'cnt', '', '', '', 'SMT']
['1.9', '23.16', 'C', '', '', '', 'TEMP1']
['1.10', '23.55', 'C', '', '123', '', 'TEMP2']
['1.11', '22.88', 'C', '-23', '', '', 'TEMP3']
['1.12', '22.86', 'C', '', '', '', 'TEMP4']
['1.13', '1.406', 'Meter', '-1.450', '1.500', '', 'DIST1']
['1.14', '0.718', 'Meter', '-0.800', '0.350', 'FAIL', 'DIST2']