使用正则表达式提取不同格式的日期并对它们进行排序 - pandas
Extracting dates that are in different formats using regex and sorting them - pandas
我是文本挖掘的新手,我需要从 *.txt 文件中提取日期并对它们进行排序。日期在句子之间(每行),它们的格式可能如下所示:
04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
如果缺少日期,请考虑 1 日,如果缺少月份,请考虑 1 月。
我的想法是提取所有日期并将其转换为 mm/dd/yyyy 格式。但是我对如何查找和替换模式有点怀疑。这就是我所做的:
import pandas as pd
doc = []
with open('dates.txt') as file:
for line in file:
doc.append(line)
df = pd.Series(doc)
df2 = pd.DataFrame(df,columns=['text'])
def myfunc(x):
if len(x)==4:
x = '01/01/'+x
else:
if not re.search('/',x):
example = re.sub('[-]','/',x)
terms = re.split('/',x)
if (len(terms)==2):
if len(terms[-1])==2:
x = '01/'+terms[0]+'/19'+terms[-1]
else:
x = '01/'+terms[0]+'/'+terms[-1]
elif len(terms[-1])==2:
x = terms[0].zfill(2)+'/'+terms[1].zfill(2)+'/19'+terms[-1]
return x
df2['text'] = df2.text.str.replace(r'(((?:\d+[/-])?\d+[/-]\d+)|\d{4})', lambda x: myfunc(x.groups('Date')[0]))
我只为数字日期格式做过。但是我有点困惑如何使用字母数字日期。
我知道这是一个粗略的代码,但这正是我得到的。
我认为这是 coursera 文本挖掘作业之一。那么你可以使用正则表达式和提取来获得解决方案。 dates.txt 即
doc = []
with open('dates.txt') as file:
for line in file:
doc.append(line)
df = pd.Series(doc)
def date_sorter():
# Get the dates in the form of words
one = df.str.extract(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{2,4})')
# Get the dates in the form of numbers
two = df.str.extract(r'((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?:(?:\/|-)\d{2,4}))')
# Get the dates where there is no days i.e only month and year
three = df.str.extract(r'((?:\d{1,2}(?:-|\/))?\d{4})')
#Convert the dates to datatime and by filling the nans in two and three. Replace month name because of spelling mistake in the text file.
dates = pd.to_datetime(one.fillna(two).fillna(three).replace('Decemeber','December',regex=True).replace('Janaury','January',regex=True))
return pd.Series(dates.sort_values())
date_sorter()
输出:
9 1971-04-10
84 1971-05-18
2 1971-07-08
53 1971-07-11
28 1971-09-12
474 1972-01-01
153 1972-01-13
13 1972-01-26
129 1972-05-06
98 1972-05-13
111 1972-06-10
225 1972-06-15
31 1972-07-20
171 1972-10-04
191 1972-11-30
486 1973-01-01
335 1973-02-01
415 1973-02-01
36 1973-02-14
405 1973-03-01
323 1973-03-01
422 1973-04-01
375 1973-06-01
380 1973-07-01
345 1973-10-01
57 1973-12-01
481 1974-01-01
436 1974-02-01
104 1974-02-24
299 1974-03-01
如果你只想 return 索引那么 return pd.Series(dates.sort_values().index)
第一个正则表达式的解析
#?: Non-capturing group
((?:\d{,2}\s)? # The two digits group. `?` refers to preceding token or group. Here the digits of 2 or 1 and space occurring once or less.
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* # The words in group ending with any letters `[]` occuring any number of times (`*`).
(?:-|\.|\s|,) # Pattern matching -,.,space
\s? #(`?` here it implies only to space i.e the preceding token)
\d{,2}[a-z]* # less than or equal to two digits having any number of letters at the end (`*`). (Eg: may be 1st, 13th , 22nd , Jan , December etc ) .
(?:-|,|\s)?# The characters -/,/space may occur once and may not occur because of `?` at the end
\s? # space may occur or may not occur at all (maximum is 1) (`?` here it refers only to space)
\d{2,4}) # Match digit which is 2 or 4
希望对您有所帮助。
我是文本挖掘的新手,我需要从 *.txt 文件中提取日期并对它们进行排序。日期在句子之间(每行),它们的格式可能如下所示:
04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
如果缺少日期,请考虑 1 日,如果缺少月份,请考虑 1 月。
我的想法是提取所有日期并将其转换为 mm/dd/yyyy 格式。但是我对如何查找和替换模式有点怀疑。这就是我所做的:
import pandas as pd
doc = []
with open('dates.txt') as file:
for line in file:
doc.append(line)
df = pd.Series(doc)
df2 = pd.DataFrame(df,columns=['text'])
def myfunc(x):
if len(x)==4:
x = '01/01/'+x
else:
if not re.search('/',x):
example = re.sub('[-]','/',x)
terms = re.split('/',x)
if (len(terms)==2):
if len(terms[-1])==2:
x = '01/'+terms[0]+'/19'+terms[-1]
else:
x = '01/'+terms[0]+'/'+terms[-1]
elif len(terms[-1])==2:
x = terms[0].zfill(2)+'/'+terms[1].zfill(2)+'/19'+terms[-1]
return x
df2['text'] = df2.text.str.replace(r'(((?:\d+[/-])?\d+[/-]\d+)|\d{4})', lambda x: myfunc(x.groups('Date')[0]))
我只为数字日期格式做过。但是我有点困惑如何使用字母数字日期。
我知道这是一个粗略的代码,但这正是我得到的。
我认为这是 coursera 文本挖掘作业之一。那么你可以使用正则表达式和提取来获得解决方案。 dates.txt 即
doc = []
with open('dates.txt') as file:
for line in file:
doc.append(line)
df = pd.Series(doc)
def date_sorter():
# Get the dates in the form of words
one = df.str.extract(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{2,4})')
# Get the dates in the form of numbers
two = df.str.extract(r'((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?:(?:\/|-)\d{2,4}))')
# Get the dates where there is no days i.e only month and year
three = df.str.extract(r'((?:\d{1,2}(?:-|\/))?\d{4})')
#Convert the dates to datatime and by filling the nans in two and three. Replace month name because of spelling mistake in the text file.
dates = pd.to_datetime(one.fillna(two).fillna(three).replace('Decemeber','December',regex=True).replace('Janaury','January',regex=True))
return pd.Series(dates.sort_values())
date_sorter()
输出:
9 1971-04-10 84 1971-05-18 2 1971-07-08 53 1971-07-11 28 1971-09-12 474 1972-01-01 153 1972-01-13 13 1972-01-26 129 1972-05-06 98 1972-05-13 111 1972-06-10 225 1972-06-15 31 1972-07-20 171 1972-10-04 191 1972-11-30 486 1973-01-01 335 1973-02-01 415 1973-02-01 36 1973-02-14 405 1973-03-01 323 1973-03-01 422 1973-04-01 375 1973-06-01 380 1973-07-01 345 1973-10-01 57 1973-12-01 481 1974-01-01 436 1974-02-01 104 1974-02-24 299 1974-03-01
如果你只想 return 索引那么 return pd.Series(dates.sort_values().index)
第一个正则表达式的解析
#?: Non-capturing group ((?:\d{,2}\s)? # The two digits group. `?` refers to preceding token or group. Here the digits of 2 or 1 and space occurring once or less. (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* # The words in group ending with any letters `[]` occuring any number of times (`*`). (?:-|\.|\s|,) # Pattern matching -,.,space \s? #(`?` here it implies only to space i.e the preceding token) \d{,2}[a-z]* # less than or equal to two digits having any number of letters at the end (`*`). (Eg: may be 1st, 13th , 22nd , Jan , December etc ) . (?:-|,|\s)?# The characters -/,/space may occur once and may not occur because of `?` at the end \s? # space may occur or may not occur at all (maximum is 1) (`?` here it refers only to space) \d{2,4}) # Match digit which is 2 or 4
希望对您有所帮助。