如何根据每行的内容对文本文件中的行进行分组?
How can I group the lines in a text file based on the contents of each line?
假设我有一个包含以下内容的文本文件:
12277 17/06/2019 350 BJ201AB FMACRI
0 J 52 4081.15 166851
0 J 52 4496.64 166852
0 J 52 5139.07 166855
0 J 52 5773.82 166858
J E 70 25 B159681
12509 21/06/2019 443 DH717WF BLANCO
B J 42 5376.63 5164/A
12504 21/06/2019 443 EB631NF LUCCIG
B J 44 5567.46 5165/A
0 J 52 5347.58 166950
0 J 52 4742.4 166953
0 J 18 1146.24 427876
0 J 4 0.4 427877
J 0 372 1 B159763
R 0 1567 1 B159764
假设我会这样阅读文件:
with open('/home/pexp1/mezzi/INPUT') as f:
lines = f.readlines()
data = [(line.rstrip()).split('\t') for line in lines]
将以某些内容(整数、字符串等)开头的每一行与其下方的每一行分组,直到找到遵循上述规则的新行为止,正确的方法是什么?
假设我想调用遵守规则的行并将所有内容放入其组中,哪种数据结构最好将这些行组合在一起?
编辑:对于不够清晰,我们深表歉意。
如果我 运行 上面的代码,当我 运行 print(data)
:
[
['12277', '17/06/2019', '350', 'BJ201AB', 'FMACRI'],
['', '', '', '', '', '0', 'J', '52', '4081.15', '166851'],
['', '', '', '', '', '0', 'J', '52', '4496.64', '166852'],
['', '', '', '', '', '0', 'J', '52', '5139.07', '166855'],
['', '', '', '', '', '0', 'J', '52', '5773.82', '166858'],
['', '', '', '', '', 'J', 'E', '70', '25', 'B159681'],
['12509', '21/06/2019', '443', 'DH717WF', 'BLANCO'],
['', '', '', '', '', 'B', 'J', '42', '5376.63', '5164/A'],
['12504', '21/06/2019', '443', 'EB631NF', 'LUCCIG'],
['', '', '', '', '', 'B', 'J', '44', '5567.46', '5165/A'],
...
]
如您所见,这是一个列表列表。
我如何将这些项目组合在一起,以这样的方式列出 contain 索引位置 0 的项目(在本例中,12277
、122509
等) 与后面的列表组合在一起(在索引位置 0、1、2、3、4 处没有元素)?
示例:
['12277', '17/06/2019', '350', 'BJ201AB', 'FMACRI']
分组为
['', '', '', '', '', '0', 'J', '52', '4081.15', '166851']
、['', '', '', '', '', '0', 'J', '52', '4496.64', '166852']
等,直到下一行包含索引 0 处的元素:['12509', '21/06/2019', '443', 'DH717WF', 'BLANCO']
EDIT2:这是我提出的解决方案:
shipments = []
shuttle_lst = []
for line in data[1:]:
if len(line[0]) < 1:
shipments.append(line)
else:
shuttle = data[data.index(line) - (len(shipments) + 1)]
shipments.append(shuttle)
new_lst = [lst for lst in shipments]
shuttle_lst.append(new_lst)
shipments.clear()
这将创建一个列表列表,其中每个 header 成为该列表的最后一个元素。
如果我理解正确,你想根据 header 行分组,而不是以 space 开头的行,对吗?
考虑以下几点:
import pprint
pp = pprint.PrettyPrinter(indent=4)
# A list of lists
data = []
with open('data.dat') as f:
for line in f:
if line.startswith(" ") or line.startswith("\t"):
if not data:
raise RuntimeError("Wrong data - first line is not legit")
data[-1].append(line.split())
continue
# If here, this is a header line
data.append([line.split()])
pp.pprint(data)
这会打印:
[ [ ['12277', '17/06/2019', '350', 'BJ201AB', 'FMACRI'],
['0', 'J', '52', '4081.15', '166851'],
['0', 'J', '52', '4496.64', '166852'],
['0', 'J', '52', '5139.07', '166855'],
['0', 'J', '52', '5773.82', '166858'],
['J', 'E', '70', '25', 'B159681']],
[ ['12509', '21/06/2019', '443', 'DH717WF', 'BLANCO'],
['B', 'J', '42', '5376.63', '5164/A']],
[ ['12504', '21/06/2019', '443', 'EB631NF', 'LUCCIG'],
['B', 'J', '44', '5567.46', '5165/A'],
['0', 'J', '52', '5347.58', '166950'],
['0', 'J', '52', '4742.4', '166953'],
['0', 'J', '18', '1146.24', '427876'],
['0', 'J', '4', '0.4', '427877'],
['J', '0', '372', '1', 'B159763'],
['R', '0', '1567', '1', 'B159764']]]
结果是列表的列表(列表!)。每个二级列表的第一项是 header 行,其余是该组中的行
假设我有一个包含以下内容的文本文件:
12277 17/06/2019 350 BJ201AB FMACRI
0 J 52 4081.15 166851
0 J 52 4496.64 166852
0 J 52 5139.07 166855
0 J 52 5773.82 166858
J E 70 25 B159681
12509 21/06/2019 443 DH717WF BLANCO
B J 42 5376.63 5164/A
12504 21/06/2019 443 EB631NF LUCCIG
B J 44 5567.46 5165/A
0 J 52 5347.58 166950
0 J 52 4742.4 166953
0 J 18 1146.24 427876
0 J 4 0.4 427877
J 0 372 1 B159763
R 0 1567 1 B159764
假设我会这样阅读文件:
with open('/home/pexp1/mezzi/INPUT') as f:
lines = f.readlines()
data = [(line.rstrip()).split('\t') for line in lines]
将以某些内容(整数、字符串等)开头的每一行与其下方的每一行分组,直到找到遵循上述规则的新行为止,正确的方法是什么? 假设我想调用遵守规则的行并将所有内容放入其组中,哪种数据结构最好将这些行组合在一起?
编辑:对于不够清晰,我们深表歉意。
如果我 运行 上面的代码,当我 运行 print(data)
:
[
['12277', '17/06/2019', '350', 'BJ201AB', 'FMACRI'],
['', '', '', '', '', '0', 'J', '52', '4081.15', '166851'],
['', '', '', '', '', '0', 'J', '52', '4496.64', '166852'],
['', '', '', '', '', '0', 'J', '52', '5139.07', '166855'],
['', '', '', '', '', '0', 'J', '52', '5773.82', '166858'],
['', '', '', '', '', 'J', 'E', '70', '25', 'B159681'],
['12509', '21/06/2019', '443', 'DH717WF', 'BLANCO'],
['', '', '', '', '', 'B', 'J', '42', '5376.63', '5164/A'],
['12504', '21/06/2019', '443', 'EB631NF', 'LUCCIG'],
['', '', '', '', '', 'B', 'J', '44', '5567.46', '5165/A'],
...
]
如您所见,这是一个列表列表。
我如何将这些项目组合在一起,以这样的方式列出 contain 索引位置 0 的项目(在本例中,12277
、122509
等) 与后面的列表组合在一起(在索引位置 0、1、2、3、4 处没有元素)?
示例:
['12277', '17/06/2019', '350', 'BJ201AB', 'FMACRI']
分组为
['', '', '', '', '', '0', 'J', '52', '4081.15', '166851']
、['', '', '', '', '', '0', 'J', '52', '4496.64', '166852']
等,直到下一行包含索引 0 处的元素:['12509', '21/06/2019', '443', 'DH717WF', 'BLANCO']
EDIT2:这是我提出的解决方案:
shipments = []
shuttle_lst = []
for line in data[1:]:
if len(line[0]) < 1:
shipments.append(line)
else:
shuttle = data[data.index(line) - (len(shipments) + 1)]
shipments.append(shuttle)
new_lst = [lst for lst in shipments]
shuttle_lst.append(new_lst)
shipments.clear()
这将创建一个列表列表,其中每个 header 成为该列表的最后一个元素。
如果我理解正确,你想根据 header 行分组,而不是以 space 开头的行,对吗?
考虑以下几点:
import pprint
pp = pprint.PrettyPrinter(indent=4)
# A list of lists
data = []
with open('data.dat') as f:
for line in f:
if line.startswith(" ") or line.startswith("\t"):
if not data:
raise RuntimeError("Wrong data - first line is not legit")
data[-1].append(line.split())
continue
# If here, this is a header line
data.append([line.split()])
pp.pprint(data)
这会打印:
[ [ ['12277', '17/06/2019', '350', 'BJ201AB', 'FMACRI'],
['0', 'J', '52', '4081.15', '166851'],
['0', 'J', '52', '4496.64', '166852'],
['0', 'J', '52', '5139.07', '166855'],
['0', 'J', '52', '5773.82', '166858'],
['J', 'E', '70', '25', 'B159681']],
[ ['12509', '21/06/2019', '443', 'DH717WF', 'BLANCO'],
['B', 'J', '42', '5376.63', '5164/A']],
[ ['12504', '21/06/2019', '443', 'EB631NF', 'LUCCIG'],
['B', 'J', '44', '5567.46', '5165/A'],
['0', 'J', '52', '5347.58', '166950'],
['0', 'J', '52', '4742.4', '166953'],
['0', 'J', '18', '1146.24', '427876'],
['0', 'J', '4', '0.4', '427877'],
['J', '0', '372', '1', 'B159763'],
['R', '0', '1567', '1', 'B159764']]]
结果是列表的列表(列表!)。每个二级列表的第一项是 header 行,其余是该组中的行