问题在于变量定义。我不确定如何解决

Question

我正在尝试使用正则表达式从文本文件中提取日期。文本文件中的日期行示例：

1530Z   1 FEB 1990

使用正则表达式：

date_matcher = re.compile("^([0-9]{4}[z].[0-9]+.[A-Z]{3}.[0-9]{4})")

我试图修改我正在使用的代码，然后 "pull" 正则表达式中的日期和时间。这是代码：

# get just the data lines, without headers.
def get_data_lines( path ):

     # where we are putting data lines (no header lines)
     data_lines = []

     #for root, dirs,  files in os.walk(path):
         #print oot, dirs, dirs2, files
     if os.path.isfile(str(path)) and (str(path.endswith('.dat'))):
         with open(path) as f:
             dt = None
             for line in f:

                 # check that line isn't empty
                 if line.strip():

                     # the compiled matcher will return a match object
                     # or null if no match was found.
                     result = data_matcher.match(line)
                     if result:
                         data_lines.append((line,dt))
                     else:
                         dtres = date_matcher.match(line)
                         if dtres:
                             line = [ w for w in line.split() if w]
                             date = line[-4:]
                             if len(date) == 4:
                                 time, day, month, year = date
                                # print date
                                 # fix the date bits
                                 time  = time.replace('Z','')
                                 day   = int(day)
                                 month = strptime(month,'%b').tm_mon
                                 year  = int(year)

                                 hour, minutes = re.findall('..',time)
                                 dt = datetime(year,month,day,int(hour),int(minutes))

     return data_lines

dt = datetime(year,month,day,int(hour),int(minutes)) 都是一行，但在我格式化时它看起来不是那样的，所以我认为这会有所帮助指出。

我知道问题出在 dt = None 上。当我让它打印出我正在提取的文件目录中的所有日期时，它只打印 NONE 与我有日期一样多的文件。

预期结果是将 dt 变量创建为空，并在遇到日期时替换为日期。所以对于这个例子，我想要的是：1530 1 2 1990
对于该行：1530Z 1 FEB 1990 并且能够从我分配给它的给定对象中调用月、日、年、时间。

Answer 1

这是我更改正则表达式模式的解决方案。我将其替换为 date_matcher = re.compile(r"((\d{4})[Z]).*(\d{1,2}).(\w{3}).(\d{4})")，这应该会为您提供所需的结果。

从这里开始，我使用 re.sub 简单地使日期看起来像您想要的那样（即比原始日期更具可读性）。它删除了 Z 字符，将月份名称更改为相应的月份编号，并删除了字符串中间多余的空格。

import re
from time import strptime
from datetime import datetime

data_matcher = re.compile('^(\s\s[0-2])')
date_matcher = re.compile(r"((\d{4})[Z]).*(\d{1,2}).(\w{3}).(\d{4})")

def get_data_lines( path ):

    # where we are putting data lines (no header lines)
    data_lines = []

    #for root, dirs,  files in os.walk(path):
    #print oot, dirs, dirs2, files
    if os.path.isfile(str(path)) and (str(path.endswith('.dat'))):
         with open(path) as f:
            dt = None
            for line in f:

            # check that line isn't empty
            if line.strip():

             # the compiled matcher will return a match object
             # or null if no match was found.
                result = data_matcher.match(line)

                if result:
                    dt = re.sub(r'((\d{4})[Z])', r'', line) #Remove Z character
                    month = date_matcher.match(line).group(4)
                    dt = re.sub(r'\b(\w{3})\b', str(strptime(month,'%b').tm_mon), line) #Change month name to number
                    dt = re.sub(r'\s+', ' ', dt) #Remove extra whitespace
                    data_lines.append((line,dt))
                    print('Data Lines: ', data_lines)

                else:
                    line = [ w for w in line.split() if w]
                    date = line[-4:]

                    if len(date) == 4:
                        time, day, month, year = date
                        # print date
                        # fix the date bits
                        time  = time.replace('Z','')
                        day   = int(day)
                        month = strptime(month,'%b').tm_mon                         
                        year  = int(year)   
                        hour, minutes = re.findall('..',time)
                        dt = datetime(year,month,day,int(hour),int(minutes)) 
                        data_lines.append((line,dt))

    return data_lines

问题在于变量定义。我不确定如何解决

Issue lies with variable definition. I am unsure how to resolve

python

regex

python-3.6