查找文本中的所有实例,最后一个词也应该是使用正则表达式搜索 python 的开头

Find all instances in text, last word should also be beginning of search with regex for python

我无法找到我遇到的正则表达式问题的解决方案。这实际上是对此 post 的一种后续问题:

我创建了以下示例文本(在我的应用程序中,文本很长并且有多个文件等):

Course 22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08 Date 22/09/2010 Duty 1. Early duty 1.3 Here there can be some other related stuff Nursegoals Interventions Record This is now the fourth note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record This is a new, note (again), i call it note 3. Course 22/09/2010 1. Early duty Record This is again a note, i call it note 2. Apple: 0/less Course 22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010 Course 3. Nightduty 1.3 Something else here Nursegoals Interventions Record 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record Course 22/09/2010 3. Nightduty Record This is a new note, i call it note 1.

现在我想从这篇文章中解析出特定的信息。我的兴趣是 'Record',所以记录后面的文本部分。以及该特定记录的日期,whit date 我指的是 02-11-2010 之类的日期以及早班、晚班或夜班的概念(因此日期为:'02-09-2010 1. 早班' ).我遇到的问题是文件中没有真正的一致性,所以有时一个日期有 2 个音符,而其他时候只有一个。有时注释部分包含文本,有时则不包含文本。

我知道如何解析记录部分,但我不知道如何首先解析日期,然后再解析注释部分。所以我想把问题一分为二。我的第一步是将整个文件拆分为单独的日期部分。第二步:遍历所有日期部分以获取该特定日期部分的注释(使用正则表达式)。然后我会制作一个包含特定日期的列表(如果我只想要特定日期,例如将其放在列单元格中,我将简单地解析该日期部分的前 13 个字符。)和注释(s) 与该日期相关的。例如:

list = [02-08-2010 1. 提前值班,[note1, note2], 02-08-2010 2. 晚值班,[note1], etc]

我们只关注日期解析,这样我的问题就清楚了。我使用以下代码:

date = r'Course\s+(.*?)(?:Course|$)'
date_list = re.findall(date, text, re.DOTALL)
for i in date_list: 
   print (i)
   print ('XXX')

输出为:

22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08 Date22/09/2010 Duty 1. Early duty 1.3 Here there can be some other related stuff Nursegoals Interventions Record This is now the fourth note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record This is a new, note (again), i call it note 3. XXX 22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010 XXX 22/09/2010 3. Nightduty Record This is a new note, i call it note 1. XXX

此输出缺少以下元素:

['Course 22/09/2010 1. Early duty Record This is again a note, i call it note 2. Apple: 0/less']

['3. Nightduty 1.3 Something else here Nursegoals Interventions Record 6.2.1.3 Confusion: Observing. Nursegoals Interventions']

所以它有点跳过,因为我认为正则表达式不考虑单词 'Course' 的结尾,也可以说是新匹配的开始。

如果有人能帮助我就太好了:) 可能我遗漏了什么..

将非捕获组更改为正先行:

r'Course\s+(.*?)(?=Course|$)'
                 ^^

参见 regex demo. An unrolled, faster, variant is r'Course\s+([^C]*(?:C(?!ourse)[^C]*)*)' (see demo)。

否则,重叠的子字符串不会匹配。

Python demo:

import re
rx = r"Course\s+(.*?)(?=Course|$)"
s = "Course 22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08 Date 22/09/2010 Duty 1. Early duty 1.3 Here there can be some other related stuff Nursegoals Interventions Record This is now the fourth note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record This is a new, note (again), i call it note 3. Course 22/09/2010 1. Early duty Record This is again a note, i call it note 2. Apple: 0/less Course 22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010 Course 3. Nightduty 1.3 Something else here Nursegoals Interventions Record 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record Course 22/09/2010 3. Nightduty Record This is a new note, i call it note 1."
results = re.findall(rx, s, re.DOTALL)
for x in results:
    print(x)

输出:

22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08 Date 22/09/2010 Duty 1. Early duty 1.3 Here there can be some other related stuff Nursegoals Interventions Record This is now the fourth note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record This is a new, note (again), i call it note 3. 
22/09/2010 1. Early duty Record This is again a note, i call it note 2. Apple: 0/less 
22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010 
3. Nightduty 1.3 Something else here Nursegoals Interventions Record 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record 
22/09/2010 3. Nightduty Record This is a new note, i call it note 1.