问题在于变量定义。我不确定如何解决
Issue lies with variable definition. I am unsure how to resolve
我正在尝试使用正则表达式从文本文件中提取日期。
文本文件中的日期行示例:
1530Z 1 FEB 1990
使用正则表达式:
date_matcher = re.compile("^([0-9]{4}[z].[0-9]+.[A-Z]{3}.[0-9]{4})")
我试图修改我正在使用的代码,然后 "pull" 正则表达式中的日期和时间。这是代码:
# get just the data lines, without headers.
def get_data_lines( path ):
# where we are putting data lines (no header lines)
data_lines = []
#for root, dirs, files in os.walk(path):
#print oot, dirs, dirs2, files
if os.path.isfile(str(path)) and (str(path.endswith('.dat'))):
with open(path) as f:
dt = None
for line in f:
# check that line isn't empty
if line.strip():
# the compiled matcher will return a match object
# or null if no match was found.
result = data_matcher.match(line)
if result:
data_lines.append((line,dt))
else:
dtres = date_matcher.match(line)
if dtres:
line = [ w for w in line.split() if w]
date = line[-4:]
if len(date) == 4:
time, day, month, year = date
# print date
# fix the date bits
time = time.replace('Z','')
day = int(day)
month = strptime(month,'%b').tm_mon
year = int(year)
hour, minutes = re.findall('..',time)
dt = datetime(year,month,day,int(hour),int(minutes))
return data_lines
dt = datetime(year,month,day,int(hour),int(minutes)) 都是一行,但在我格式化时它看起来不是那样的,所以我认为这会有所帮助指出。
我知道问题出在 dt = None 上。当我让它打印出我正在提取的文件目录中的所有日期时,它只打印 NONE 与我有日期一样多的文件。
预期结果是将 dt 变量创建为空,并在遇到日期时替换为日期。
所以对于这个例子,我想要的是:1530 1 2 1990
对于该行:1530Z 1 FEB 1990
并且能够从我分配给它的给定对象中调用月、日、年、时间。
这是我更改正则表达式模式的解决方案。我将其替换为 date_matcher = re.compile(r"((\d{4})[Z]).*(\d{1,2}).(\w{3}).(\d{4})")
,这应该会为您提供所需的结果。
从这里开始,我使用 re.sub
简单地使日期看起来像您想要的那样(即比原始日期更具可读性)。它删除了 Z 字符,将月份名称更改为相应的月份编号,并删除了字符串中间多余的空格。
import re
from time import strptime
from datetime import datetime
data_matcher = re.compile('^(\s\s[0-2])')
date_matcher = re.compile(r"((\d{4})[Z]).*(\d{1,2}).(\w{3}).(\d{4})")
def get_data_lines( path ):
# where we are putting data lines (no header lines)
data_lines = []
#for root, dirs, files in os.walk(path):
#print oot, dirs, dirs2, files
if os.path.isfile(str(path)) and (str(path.endswith('.dat'))):
with open(path) as f:
dt = None
for line in f:
# check that line isn't empty
if line.strip():
# the compiled matcher will return a match object
# or null if no match was found.
result = data_matcher.match(line)
if result:
dt = re.sub(r'((\d{4})[Z])', r'', line) #Remove Z character
month = date_matcher.match(line).group(4)
dt = re.sub(r'\b(\w{3})\b', str(strptime(month,'%b').tm_mon), line) #Change month name to number
dt = re.sub(r'\s+', ' ', dt) #Remove extra whitespace
data_lines.append((line,dt))
print('Data Lines: ', data_lines)
else:
line = [ w for w in line.split() if w]
date = line[-4:]
if len(date) == 4:
time, day, month, year = date
# print date
# fix the date bits
time = time.replace('Z','')
day = int(day)
month = strptime(month,'%b').tm_mon
year = int(year)
hour, minutes = re.findall('..',time)
dt = datetime(year,month,day,int(hour),int(minutes))
data_lines.append((line,dt))
return data_lines
我正在尝试使用正则表达式从文本文件中提取日期。 文本文件中的日期行示例:
1530Z 1 FEB 1990
使用正则表达式:
date_matcher = re.compile("^([0-9]{4}[z].[0-9]+.[A-Z]{3}.[0-9]{4})")
我试图修改我正在使用的代码,然后 "pull" 正则表达式中的日期和时间。这是代码:
# get just the data lines, without headers.
def get_data_lines( path ):
# where we are putting data lines (no header lines)
data_lines = []
#for root, dirs, files in os.walk(path):
#print oot, dirs, dirs2, files
if os.path.isfile(str(path)) and (str(path.endswith('.dat'))):
with open(path) as f:
dt = None
for line in f:
# check that line isn't empty
if line.strip():
# the compiled matcher will return a match object
# or null if no match was found.
result = data_matcher.match(line)
if result:
data_lines.append((line,dt))
else:
dtres = date_matcher.match(line)
if dtres:
line = [ w for w in line.split() if w]
date = line[-4:]
if len(date) == 4:
time, day, month, year = date
# print date
# fix the date bits
time = time.replace('Z','')
day = int(day)
month = strptime(month,'%b').tm_mon
year = int(year)
hour, minutes = re.findall('..',time)
dt = datetime(year,month,day,int(hour),int(minutes))
return data_lines
dt = datetime(year,month,day,int(hour),int(minutes)) 都是一行,但在我格式化时它看起来不是那样的,所以我认为这会有所帮助指出。
我知道问题出在 dt = None 上。当我让它打印出我正在提取的文件目录中的所有日期时,它只打印 NONE 与我有日期一样多的文件。
预期结果是将 dt 变量创建为空,并在遇到日期时替换为日期。
所以对于这个例子,我想要的是:1530 1 2 1990
对于该行:1530Z 1 FEB 1990
并且能够从我分配给它的给定对象中调用月、日、年、时间。
这是我更改正则表达式模式的解决方案。我将其替换为 date_matcher = re.compile(r"((\d{4})[Z]).*(\d{1,2}).(\w{3}).(\d{4})")
,这应该会为您提供所需的结果。
从这里开始,我使用 re.sub
简单地使日期看起来像您想要的那样(即比原始日期更具可读性)。它删除了 Z 字符,将月份名称更改为相应的月份编号,并删除了字符串中间多余的空格。
import re
from time import strptime
from datetime import datetime
data_matcher = re.compile('^(\s\s[0-2])')
date_matcher = re.compile(r"((\d{4})[Z]).*(\d{1,2}).(\w{3}).(\d{4})")
def get_data_lines( path ):
# where we are putting data lines (no header lines)
data_lines = []
#for root, dirs, files in os.walk(path):
#print oot, dirs, dirs2, files
if os.path.isfile(str(path)) and (str(path.endswith('.dat'))):
with open(path) as f:
dt = None
for line in f:
# check that line isn't empty
if line.strip():
# the compiled matcher will return a match object
# or null if no match was found.
result = data_matcher.match(line)
if result:
dt = re.sub(r'((\d{4})[Z])', r'', line) #Remove Z character
month = date_matcher.match(line).group(4)
dt = re.sub(r'\b(\w{3})\b', str(strptime(month,'%b').tm_mon), line) #Change month name to number
dt = re.sub(r'\s+', ' ', dt) #Remove extra whitespace
data_lines.append((line,dt))
print('Data Lines: ', data_lines)
else:
line = [ w for w in line.split() if w]
date = line[-4:]
if len(date) == 4:
time, day, month, year = date
# print date
# fix the date bits
time = time.replace('Z','')
day = int(day)
month = strptime(month,'%b').tm_mon
year = int(year)
hour, minutes = re.findall('..',time)
dt = datetime(year,month,day,int(hour),int(minutes))
data_lines.append((line,dt))
return data_lines