Python 解析半固定宽度文件的正则表达式
Python regex to parse semi-fixed width file
我有一个数据文件,它本质上是一个固定宽度的文本文件。文本中有可变数量的空格和位置。我正在尝试将文件解析为带有 python 的列表,但无法找出合适的正则表达式(当然我也对非正则表达式选项开放)。
Date Run By Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Level 7 Level 8 Level 9
11-15-2014 12:27:43 AM 1 ** 259.0
11-15-2014 7:47:09 AM 1 ** 98.0
11-15-2014 3:45:07 PM 1 ** 785.0
11-16-2014 12:27:43 AM 1 ** 245.0
11-16-2014 7:51:36 AM 1 ** 96.0
11-16-2014 3:43:12 PM 1 ** 788.0
11-17-2014 12:27:43 AM 1 ** 248.0
11-17-2014 7:51:21 AM 1 ** 104.0
11-17-2014 12:45:57 PM 1 ** 97.0 257.0 793.0
11-17-2014 3:46:33 PM 1 ** 792.0
11-18-2014 12:32:31 AM 1 ** 253.0
11-18-2014 7:50:31 AM 1 ** 104.0
11-18-2014 3:48:43 PM 1 ** 781.0
11-19-2014 12:30:36 AM 1 ** 260.0
11-19-2014 8:40:26 AM 1 ** 102.0
11-19-2014 3:47:45 PM 1 ** 803.0
11-20-2014 12:28:40 AM 1 ** 243.0
11-20-2014 7:53:38 AM 1 ** 107.0
11-20-2014 3:43:55 PM 1 ** 787.0
11-21-2014 1:03:45 AM 0 PS 245.0
11-21-2014 7:52:55 AM 1 ** 101.0
11-21-2014 3:44:09 PM 1 ** 789.0
11-22-2014 12:37:26 AM 1 ** 250.0
11-22-2014 7:49:55 AM 1 ** 103.0
到目前为止我已经尝试过:
for line in f:
line = re.split(r' (?=[A-Z])| (?=[0-9])| ',line)
但是,我什至没有对齐列。我需要他们在下游排队使用。
期望的输出是(对不起,行数有限,手动解析它是致命的!)。
['Date', '', 'Run', 'By', 'Level 1', 'Level 2', 'Level 3', 'Level 4', 'Level 5', 'Level 6', 'Level 7', 'Level 8', 'Level 9','\r\n']
['\r\n']
['\r\n']
['11-15-2014', '12:27:43', 'AM 1', '**', '', '259.0', '', '', '', '', '', '', '', '\r\n']
['11-15-2014', '7:47:09', 'AM 1', '**', '98.0', '', '', '', '', '', '', '', '', '\r\n']
['11-15-2014', '3:45:07', 'PM 1', '**', '', '', '785.0', '', '', '', '', '', '', '\r\n']
...
...
['11-17-2014', '12:45:57', 'PM 1', '**', '97.0', '257.0', '793.0', '', '', '', '', '', '', '\r\n']
本质上是 13 项后跟一个换行符;将日期和时间合并到一个字段中就可以了,主要是我需要日期和三个级别才能正确排列;只有级别 1、级别 2 和级别 3 的值。值通常是单个 level/row,但偶尔也会出现所有三个值(如图所示)。
这看起来像 tsv
格式,或者 tab separated v价值观。尝试在选项卡上拆分行:
for line in f:
print line.split('\t')
如果是这种情况,您可以使用 csv
module、设置选项卡作为分隔符。
编辑:
OP 确认这不是 tsv。这是我的建议:
headers = None
for line in input_file:
splits = line.split(' ')
if headers:
print zip(headers, [v.strip() for v in splitter.split(line)])
continue
headers = splits
我不会使用 re.split()
,而是 re.findall()
和 (\d{2}-\d{2}-\d{4})\s+(\d{,2}:\d{2}:\d{2})\s(\wM \d)\s+\*\*\s{10,15}([0-9.]*)\s{10,15}([0-9.]*)\s{10,15}([0-9.]*)
.
我知道这很脏,但由于这似乎不是固定长度的空格作为分隔符,所以这可能会奏效。如果数字变大,将停止工作。
我不能说这在生产环境中有多可靠,但它适用于示例数据。
鉴于:
txt='''\
Date Run By Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Level 7 Level 8 Level 9
11-15-2014 12:27:43 AM 1 ** 259.0
11-15-2014 7:47:09 AM 1 ** 98.0
11-15-2014 3:45:07 PM 1 ** 785.0
11-16-2014 12:27:43 AM 1 ** 245.0
11-16-2014 7:51:36 AM 1 ** 96.0
11-16-2014 3:43:12 PM 1 ** 788.0
11-17-2014 12:27:43 AM 1 ** 248.0
11-17-2014 7:51:21 AM 1 ** 104.0
11-17-2014 12:45:57 PM 1 ** 97.0 257.0 793.0
11-17-2014 3:46:33 PM 1 ** 792.0
11-18-2014 12:32:31 AM 1 ** 253.0
11-18-2014 7:50:31 AM 1 ** 104.0
11-18-2014 3:48:43 PM 1 ** 781.0
11-19-2014 12:30:36 AM 1 ** 260.0
11-19-2014 8:40:26 AM 1 ** 102.0
11-19-2014 3:47:45 PM 1 ** 803.0
11-20-2014 12:28:40 AM 1 ** 243.0
11-20-2014 7:53:38 AM 1 ** 107.0
11-20-2014 3:43:55 PM 1 ** 787.0
11-21-2014 1:03:45 AM 0 PS 245.0
11-21-2014 7:52:55 AM 1 ** 101.0
11-21-2014 3:44:09 PM 1 ** 789.0
11-22-2014 12:37:26 AM 1 ** 250.0
11-22-2014 7:49:55 AM 1 ** 103.0 '''
尝试:
import re
data=txt.splitlines()
header=data.pop(0)
for line in data:
m=re.search(r'^([\d\-\s:]+)(AM|PM)\s+(\d)\s+(..)([\s\d\.]+)$', line)
if m:
l=[]
l.append(m.group(1)+m.group(2))
l.append(m.group(3))
l.append(m.group(4))
l.append([e.strip() for e in re.findall(r'(\s{15,16}|\s*\d+\.\d)', m.group(5))])
print l
打印:
['11-15-2014 12:27:43 AM', '1', '**', ['', '259.0', '', '', '', '', '', '']]
['11-15-2014 7:47:09 AM', '1', '**', ['98.0', '', '', '', '', '', '', '']]
['11-15-2014 3:45:07 PM', '1', '**', ['', '', '785.0', '', '', '', '', '']]
['11-16-2014 12:27:43 AM', '1', '**', ['', '245.0', '', '', '', '', '', '']]
['11-16-2014 7:51:36 AM', '1', '**', ['96.0', '', '', '', '', '', '', '']]
['11-16-2014 3:43:12 PM', '1', '**', ['', '', '788.0', '', '', '', '', '']]
['11-17-2014 12:27:43 AM', '1', '**', ['', '248.0', '', '', '', '', '', '']]
['11-17-2014 7:51:21 AM', '1', '**', ['104.0', '', '', '', '', '', '', '']]
['11-17-2014 12:45:57 PM', '1', '**', ['97.0', '257.0', '793.0', '', '', '', '', '']]
['11-17-2014 3:46:33 PM', '1', '**', ['', '', '792.0', '', '', '', '', '']]
['11-18-2014 12:32:31 AM', '1', '**', ['', '253.0', '', '', '', '', '', '']]
['11-18-2014 7:50:31 AM', '1', '**', ['104.0', '', '', '', '', '', '', '']]
['11-18-2014 3:48:43 PM', '1', '**', ['', '', '781.0', '', '', '', '', '']]
['11-19-2014 12:30:36 AM', '1', '**', ['', '260.0', '', '', '', '', '', '']]
['11-19-2014 8:40:26 AM', '1', '**', ['102.0', '', '', '', '', '', '', '']]
['11-19-2014 3:47:45 PM', '1', '**', ['', '', '803.0', '', '', '', '', '']]
['11-20-2014 12:28:40 AM', '1', '**', ['', '243.0', '', '', '', '', '', '']]
['11-20-2014 7:53:38 AM', '1', '**', ['107.0', '', '', '', '', '', '', '']]
['11-20-2014 3:43:55 PM', '1', '**', ['', '', '787.0', '', '', '', '', '']]
['11-21-2014 1:03:45 AM', '0', 'PS', ['', '245.0', '', '', '', '', '', '']]
['11-21-2014 7:52:55 AM', '1', '**', ['101.0', '', '', '', '', '', '', '']]
['11-21-2014 3:44:09 PM', '1', '**', ['', '', '789.0', '', '', '', '', '']]
['11-22-2014 12:37:26 AM', '1', '**', ['', '250.0', '', '', '', '', '', '']]
['11-22-2014 7:49:55 AM', '1', '**', ['103.0', '']]
似乎唯一具有可变宽度的部分是 date/time。我建议这样做:
m = re.match(r'(\d+-\d+-\d+ \d+:\d+:\d+) (.{4})(.{6})(.{16})(.{15})(.{15})', line)
if m:
print [x.strip() for x in m.groups()]
输出:
['11-15-2014 12:27:43', 'AM 1', '**', '', '259.0', '']
['11-15-2014 7:47:09', 'AM 1', '**', '98.0', '', '']
['11-15-2014 3:45:07', 'PM 1', '**', '', '', '785.0']
['11-16-2014 12:27:43', 'AM 1', '**', '', '245.0', '']
['11-16-2014 7:51:36', 'AM 1', '**', '96.0', '', '']
['11-16-2014 3:43:12', 'PM 1', '**', '', '', '788.0']
['11-17-2014 12:27:43', 'AM 1', '**', '', '248.0', '']
['11-17-2014 7:51:21', 'AM 1', '**', '104.0', '', '']
['11-17-2014 12:45:57', 'PM 1', '**', '97.0', '257.0', '793.0']
['11-17-2014 3:46:33', 'PM 1', '**', '', '', '792.0']
['11-18-2014 12:32:31', 'AM 1', '**', '', '253.0', '']
['11-18-2014 7:50:31', 'AM 1', '**', '104.0', '', '']
['11-18-2014 3:48:43', 'PM 1', '**', '', '', '781.0']
['11-19-2014 12:30:36', 'AM 1', '**', '', '260.0', '']
['11-19-2014 8:40:26', 'AM 1', '**', '102.0', '', '']
['11-19-2014 3:47:45', 'PM 1', '**', '', '', '803.0']
['11-20-2014 12:28:40', 'AM 1', '**', '', '243.0', '']
['11-20-2014 7:53:38', 'AM 1', '**', '107.0', '', '']
['11-20-2014 3:43:55', 'PM 1', '**', '', '', '787.0']
['11-21-2014 1:03:45', 'AM 0', 'PS', '', '245.0', '']
['11-21-2014 7:52:55', 'AM 1', '**', '101.0', '', '']
['11-21-2014 3:44:09', 'PM 1', '**', '', '', '789.0']
['11-22-2014 12:37:26', 'AM 1', '**', '', '250.0', '']
['11-22-2014 7:49:55', 'AM 1', '**', '103.0', '', '']
(虽然将 AM/PM 与时间分组更为典型,但我试图遵循所需输出的描述)
我有一个数据文件,它本质上是一个固定宽度的文本文件。文本中有可变数量的空格和位置。我正在尝试将文件解析为带有 python 的列表,但无法找出合适的正则表达式(当然我也对非正则表达式选项开放)。
Date Run By Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Level 7 Level 8 Level 9
11-15-2014 12:27:43 AM 1 ** 259.0
11-15-2014 7:47:09 AM 1 ** 98.0
11-15-2014 3:45:07 PM 1 ** 785.0
11-16-2014 12:27:43 AM 1 ** 245.0
11-16-2014 7:51:36 AM 1 ** 96.0
11-16-2014 3:43:12 PM 1 ** 788.0
11-17-2014 12:27:43 AM 1 ** 248.0
11-17-2014 7:51:21 AM 1 ** 104.0
11-17-2014 12:45:57 PM 1 ** 97.0 257.0 793.0
11-17-2014 3:46:33 PM 1 ** 792.0
11-18-2014 12:32:31 AM 1 ** 253.0
11-18-2014 7:50:31 AM 1 ** 104.0
11-18-2014 3:48:43 PM 1 ** 781.0
11-19-2014 12:30:36 AM 1 ** 260.0
11-19-2014 8:40:26 AM 1 ** 102.0
11-19-2014 3:47:45 PM 1 ** 803.0
11-20-2014 12:28:40 AM 1 ** 243.0
11-20-2014 7:53:38 AM 1 ** 107.0
11-20-2014 3:43:55 PM 1 ** 787.0
11-21-2014 1:03:45 AM 0 PS 245.0
11-21-2014 7:52:55 AM 1 ** 101.0
11-21-2014 3:44:09 PM 1 ** 789.0
11-22-2014 12:37:26 AM 1 ** 250.0
11-22-2014 7:49:55 AM 1 ** 103.0
到目前为止我已经尝试过:
for line in f:
line = re.split(r' (?=[A-Z])| (?=[0-9])| ',line)
但是,我什至没有对齐列。我需要他们在下游排队使用。
期望的输出是(对不起,行数有限,手动解析它是致命的!)。
['Date', '', 'Run', 'By', 'Level 1', 'Level 2', 'Level 3', 'Level 4', 'Level 5', 'Level 6', 'Level 7', 'Level 8', 'Level 9','\r\n']
['\r\n']
['\r\n']
['11-15-2014', '12:27:43', 'AM 1', '**', '', '259.0', '', '', '', '', '', '', '', '\r\n']
['11-15-2014', '7:47:09', 'AM 1', '**', '98.0', '', '', '', '', '', '', '', '', '\r\n']
['11-15-2014', '3:45:07', 'PM 1', '**', '', '', '785.0', '', '', '', '', '', '', '\r\n']
...
...
['11-17-2014', '12:45:57', 'PM 1', '**', '97.0', '257.0', '793.0', '', '', '', '', '', '', '\r\n']
本质上是 13 项后跟一个换行符;将日期和时间合并到一个字段中就可以了,主要是我需要日期和三个级别才能正确排列;只有级别 1、级别 2 和级别 3 的值。值通常是单个 level/row,但偶尔也会出现所有三个值(如图所示)。
这看起来像 tsv
格式,或者 tab separated v价值观。尝试在选项卡上拆分行:
for line in f:
print line.split('\t')
如果是这种情况,您可以使用 csv
module、设置选项卡作为分隔符。
编辑:
OP 确认这不是 tsv。这是我的建议:
headers = None
for line in input_file:
splits = line.split(' ')
if headers:
print zip(headers, [v.strip() for v in splitter.split(line)])
continue
headers = splits
我不会使用 re.split()
,而是 re.findall()
和 (\d{2}-\d{2}-\d{4})\s+(\d{,2}:\d{2}:\d{2})\s(\wM \d)\s+\*\*\s{10,15}([0-9.]*)\s{10,15}([0-9.]*)\s{10,15}([0-9.]*)
.
我知道这很脏,但由于这似乎不是固定长度的空格作为分隔符,所以这可能会奏效。如果数字变大,将停止工作。
我不能说这在生产环境中有多可靠,但它适用于示例数据。
鉴于:
txt='''\
Date Run By Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Level 7 Level 8 Level 9
11-15-2014 12:27:43 AM 1 ** 259.0
11-15-2014 7:47:09 AM 1 ** 98.0
11-15-2014 3:45:07 PM 1 ** 785.0
11-16-2014 12:27:43 AM 1 ** 245.0
11-16-2014 7:51:36 AM 1 ** 96.0
11-16-2014 3:43:12 PM 1 ** 788.0
11-17-2014 12:27:43 AM 1 ** 248.0
11-17-2014 7:51:21 AM 1 ** 104.0
11-17-2014 12:45:57 PM 1 ** 97.0 257.0 793.0
11-17-2014 3:46:33 PM 1 ** 792.0
11-18-2014 12:32:31 AM 1 ** 253.0
11-18-2014 7:50:31 AM 1 ** 104.0
11-18-2014 3:48:43 PM 1 ** 781.0
11-19-2014 12:30:36 AM 1 ** 260.0
11-19-2014 8:40:26 AM 1 ** 102.0
11-19-2014 3:47:45 PM 1 ** 803.0
11-20-2014 12:28:40 AM 1 ** 243.0
11-20-2014 7:53:38 AM 1 ** 107.0
11-20-2014 3:43:55 PM 1 ** 787.0
11-21-2014 1:03:45 AM 0 PS 245.0
11-21-2014 7:52:55 AM 1 ** 101.0
11-21-2014 3:44:09 PM 1 ** 789.0
11-22-2014 12:37:26 AM 1 ** 250.0
11-22-2014 7:49:55 AM 1 ** 103.0 '''
尝试:
import re
data=txt.splitlines()
header=data.pop(0)
for line in data:
m=re.search(r'^([\d\-\s:]+)(AM|PM)\s+(\d)\s+(..)([\s\d\.]+)$', line)
if m:
l=[]
l.append(m.group(1)+m.group(2))
l.append(m.group(3))
l.append(m.group(4))
l.append([e.strip() for e in re.findall(r'(\s{15,16}|\s*\d+\.\d)', m.group(5))])
print l
打印:
['11-15-2014 12:27:43 AM', '1', '**', ['', '259.0', '', '', '', '', '', '']]
['11-15-2014 7:47:09 AM', '1', '**', ['98.0', '', '', '', '', '', '', '']]
['11-15-2014 3:45:07 PM', '1', '**', ['', '', '785.0', '', '', '', '', '']]
['11-16-2014 12:27:43 AM', '1', '**', ['', '245.0', '', '', '', '', '', '']]
['11-16-2014 7:51:36 AM', '1', '**', ['96.0', '', '', '', '', '', '', '']]
['11-16-2014 3:43:12 PM', '1', '**', ['', '', '788.0', '', '', '', '', '']]
['11-17-2014 12:27:43 AM', '1', '**', ['', '248.0', '', '', '', '', '', '']]
['11-17-2014 7:51:21 AM', '1', '**', ['104.0', '', '', '', '', '', '', '']]
['11-17-2014 12:45:57 PM', '1', '**', ['97.0', '257.0', '793.0', '', '', '', '', '']]
['11-17-2014 3:46:33 PM', '1', '**', ['', '', '792.0', '', '', '', '', '']]
['11-18-2014 12:32:31 AM', '1', '**', ['', '253.0', '', '', '', '', '', '']]
['11-18-2014 7:50:31 AM', '1', '**', ['104.0', '', '', '', '', '', '', '']]
['11-18-2014 3:48:43 PM', '1', '**', ['', '', '781.0', '', '', '', '', '']]
['11-19-2014 12:30:36 AM', '1', '**', ['', '260.0', '', '', '', '', '', '']]
['11-19-2014 8:40:26 AM', '1', '**', ['102.0', '', '', '', '', '', '', '']]
['11-19-2014 3:47:45 PM', '1', '**', ['', '', '803.0', '', '', '', '', '']]
['11-20-2014 12:28:40 AM', '1', '**', ['', '243.0', '', '', '', '', '', '']]
['11-20-2014 7:53:38 AM', '1', '**', ['107.0', '', '', '', '', '', '', '']]
['11-20-2014 3:43:55 PM', '1', '**', ['', '', '787.0', '', '', '', '', '']]
['11-21-2014 1:03:45 AM', '0', 'PS', ['', '245.0', '', '', '', '', '', '']]
['11-21-2014 7:52:55 AM', '1', '**', ['101.0', '', '', '', '', '', '', '']]
['11-21-2014 3:44:09 PM', '1', '**', ['', '', '789.0', '', '', '', '', '']]
['11-22-2014 12:37:26 AM', '1', '**', ['', '250.0', '', '', '', '', '', '']]
['11-22-2014 7:49:55 AM', '1', '**', ['103.0', '']]
似乎唯一具有可变宽度的部分是 date/time。我建议这样做:
m = re.match(r'(\d+-\d+-\d+ \d+:\d+:\d+) (.{4})(.{6})(.{16})(.{15})(.{15})', line)
if m:
print [x.strip() for x in m.groups()]
输出:
['11-15-2014 12:27:43', 'AM 1', '**', '', '259.0', '']
['11-15-2014 7:47:09', 'AM 1', '**', '98.0', '', '']
['11-15-2014 3:45:07', 'PM 1', '**', '', '', '785.0']
['11-16-2014 12:27:43', 'AM 1', '**', '', '245.0', '']
['11-16-2014 7:51:36', 'AM 1', '**', '96.0', '', '']
['11-16-2014 3:43:12', 'PM 1', '**', '', '', '788.0']
['11-17-2014 12:27:43', 'AM 1', '**', '', '248.0', '']
['11-17-2014 7:51:21', 'AM 1', '**', '104.0', '', '']
['11-17-2014 12:45:57', 'PM 1', '**', '97.0', '257.0', '793.0']
['11-17-2014 3:46:33', 'PM 1', '**', '', '', '792.0']
['11-18-2014 12:32:31', 'AM 1', '**', '', '253.0', '']
['11-18-2014 7:50:31', 'AM 1', '**', '104.0', '', '']
['11-18-2014 3:48:43', 'PM 1', '**', '', '', '781.0']
['11-19-2014 12:30:36', 'AM 1', '**', '', '260.0', '']
['11-19-2014 8:40:26', 'AM 1', '**', '102.0', '', '']
['11-19-2014 3:47:45', 'PM 1', '**', '', '', '803.0']
['11-20-2014 12:28:40', 'AM 1', '**', '', '243.0', '']
['11-20-2014 7:53:38', 'AM 1', '**', '107.0', '', '']
['11-20-2014 3:43:55', 'PM 1', '**', '', '', '787.0']
['11-21-2014 1:03:45', 'AM 0', 'PS', '', '245.0', '']
['11-21-2014 7:52:55', 'AM 1', '**', '101.0', '', '']
['11-21-2014 3:44:09', 'PM 1', '**', '', '', '789.0']
['11-22-2014 12:37:26', 'AM 1', '**', '', '250.0', '']
['11-22-2014 7:49:55', 'AM 1', '**', '103.0', '', '']
(虽然将 AM/PM 与时间分组更为典型,但我试图遵循所需输出的描述)