查找文本中的所有实例,最后一个词也应该是使用正则表达式搜索 python 的开头
Find all instances in text, last word should also be beginning of search with regex for python
我无法找到我遇到的正则表达式问题的解决方案。这实际上是对此 post 的一种后续问题:
我创建了以下示例文本(在我的应用程序中,文本很长并且有多个文件等):
Course 22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08
Date 22/09/2010 Duty 1. Early duty 1.3 Here there can be some other
related stuff Nursegoals Interventions Record This is now the fourth
note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record
This is a new, note (again), i call it note 3. Course 22/09/2010 1.
Early duty Record This is again a note, i call it note 2. Apple:
0/less Course 22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010
06:22:25 Date 22/09/2010 Course 3. Nightduty 1.3 Something else here
Nursegoals Interventions Record 6.2.1.3 Confusion: Observing.
Nursegoals Interventions Record Course 22/09/2010 3. Nightduty Record
This is a new note, i call it note 1.
现在我想从这篇文章中解析出特定的信息。我的兴趣是 'Record',所以记录后面的文本部分。以及该特定记录的日期,whit date 我指的是 02-11-2010 之类的日期以及早班、晚班或夜班的概念(因此日期为:'02-09-2010 1. 早班' ).我遇到的问题是文件中没有真正的一致性,所以有时一个日期有 2 个音符,而其他时候只有一个。有时注释部分包含文本,有时则不包含文本。
我知道如何解析记录部分,但我不知道如何首先解析日期,然后再解析注释部分。所以我想把问题一分为二。我的第一步是将整个文件拆分为单独的日期部分。第二步:遍历所有日期部分以获取该特定日期部分的注释(使用正则表达式)。然后我会制作一个包含特定日期的列表(如果我只想要特定日期,例如将其放在列单元格中,我将简单地解析该日期部分的前 13 个字符。)和注释(s) 与该日期相关的。例如:
list = [02-08-2010 1. 提前值班,[note1, note2], 02-08-2010 2. 晚值班,[note1], etc]
我们只关注日期解析,这样我的问题就清楚了。我使用以下代码:
date = r'Course\s+(.*?)(?:Course|$)'
date_list = re.findall(date, text, re.DOTALL)
for i in date_list:
print (i)
print ('XXX')
输出为:
22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08
Date22/09/2010 Duty 1. Early duty 1.3 Here there can be some other
related stuff Nursegoals Interventions Record This is now the fourth
note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record
This is a new, note (again), i call it note 3. XXX 22/09/2010 3.
Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010 XXX
22/09/2010 3. Nightduty Record This is a new note, i call it note 1.
XXX
此输出缺少以下元素:
['Course 22/09/2010 1. Early duty Record This is again a note, i call
it note 2. Apple: 0/less']
和
['3. Nightduty 1.3 Something else here Nursegoals Interventions Record
6.2.1.3 Confusion: Observing. Nursegoals Interventions']
所以它有点跳过,因为我认为正则表达式不考虑单词 'Course' 的结尾,也可以说是新匹配的开始。
如果有人能帮助我就太好了:) 可能我遗漏了什么..
将非捕获组更改为正先行:
r'Course\s+(.*?)(?=Course|$)'
^^
参见 regex demo. An unrolled, faster, variant is r'Course\s+([^C]*(?:C(?!ourse)[^C]*)*)'
(see demo)。
否则,重叠的子字符串不会匹配。
import re
rx = r"Course\s+(.*?)(?=Course|$)"
s = "Course 22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08 Date 22/09/2010 Duty 1. Early duty 1.3 Here there can be some other related stuff Nursegoals Interventions Record This is now the fourth note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record This is a new, note (again), i call it note 3. Course 22/09/2010 1. Early duty Record This is again a note, i call it note 2. Apple: 0/less Course 22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010 Course 3. Nightduty 1.3 Something else here Nursegoals Interventions Record 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record Course 22/09/2010 3. Nightduty Record This is a new note, i call it note 1."
results = re.findall(rx, s, re.DOTALL)
for x in results:
print(x)
输出:
22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08 Date 22/09/2010 Duty 1. Early duty 1.3 Here there can be some other related stuff Nursegoals Interventions Record This is now the fourth note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record This is a new, note (again), i call it note 3.
22/09/2010 1. Early duty Record This is again a note, i call it note 2. Apple: 0/less
22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010
3. Nightduty 1.3 Something else here Nursegoals Interventions Record 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record
22/09/2010 3. Nightduty Record This is a new note, i call it note 1.
我无法找到我遇到的正则表达式问题的解决方案。这实际上是对此 post 的一种后续问题:
我创建了以下示例文本(在我的应用程序中,文本很长并且有多个文件等):
Course 22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08 Date 22/09/2010 Duty 1. Early duty 1.3 Here there can be some other related stuff Nursegoals Interventions Record This is now the fourth note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record This is a new, note (again), i call it note 3. Course 22/09/2010 1. Early duty Record This is again a note, i call it note 2. Apple: 0/less Course 22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010 Course 3. Nightduty 1.3 Something else here Nursegoals Interventions Record 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record Course 22/09/2010 3. Nightduty Record This is a new note, i call it note 1.
现在我想从这篇文章中解析出特定的信息。我的兴趣是 'Record',所以记录后面的文本部分。以及该特定记录的日期,whit date 我指的是 02-11-2010 之类的日期以及早班、晚班或夜班的概念(因此日期为:'02-09-2010 1. 早班' ).我遇到的问题是文件中没有真正的一致性,所以有时一个日期有 2 个音符,而其他时候只有一个。有时注释部分包含文本,有时则不包含文本。
我知道如何解析记录部分,但我不知道如何首先解析日期,然后再解析注释部分。所以我想把问题一分为二。我的第一步是将整个文件拆分为单独的日期部分。第二步:遍历所有日期部分以获取该特定日期部分的注释(使用正则表达式)。然后我会制作一个包含特定日期的列表(如果我只想要特定日期,例如将其放在列单元格中,我将简单地解析该日期部分的前 13 个字符。)和注释(s) 与该日期相关的。例如:
list = [02-08-2010 1. 提前值班,[note1, note2], 02-08-2010 2. 晚值班,[note1], etc]
我们只关注日期解析,这样我的问题就清楚了。我使用以下代码:
date = r'Course\s+(.*?)(?:Course|$)'
date_list = re.findall(date, text, re.DOTALL)
for i in date_list:
print (i)
print ('XXX')
输出为:
22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08 Date22/09/2010 Duty 1. Early duty 1.3 Here there can be some other related stuff Nursegoals Interventions Record This is now the fourth note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record This is a new, note (again), i call it note 3. XXX 22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010 XXX 22/09/2010 3. Nightduty Record This is a new note, i call it note 1. XXX
此输出缺少以下元素:
['Course 22/09/2010 1. Early duty Record This is again a note, i call it note 2. Apple: 0/less']
和
['3. Nightduty 1.3 Something else here Nursegoals Interventions Record 6.2.1.3 Confusion: Observing. Nursegoals Interventions']
所以它有点跳过,因为我认为正则表达式不考虑单词 'Course' 的结尾,也可以说是新匹配的开始。
如果有人能帮助我就太好了:) 可能我遗漏了什么..
将非捕获组更改为正先行:
r'Course\s+(.*?)(?=Course|$)'
^^
参见 regex demo. An unrolled, faster, variant is r'Course\s+([^C]*(?:C(?!ourse)[^C]*)*)'
(see demo)。
否则,重叠的子字符串不会匹配。
import re
rx = r"Course\s+(.*?)(?=Course|$)"
s = "Course 22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08 Date 22/09/2010 Duty 1. Early duty 1.3 Here there can be some other related stuff Nursegoals Interventions Record This is now the fourth note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record This is a new, note (again), i call it note 3. Course 22/09/2010 1. Early duty Record This is again a note, i call it note 2. Apple: 0/less Course 22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010 Course 3. Nightduty 1.3 Something else here Nursegoals Interventions Record 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record Course 22/09/2010 3. Nightduty Record This is a new note, i call it note 1."
results = re.findall(rx, s, re.DOTALL)
for x in results:
print(x)
输出:
22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08 Date 22/09/2010 Duty 1. Early duty 1.3 Here there can be some other related stuff Nursegoals Interventions Record This is now the fourth note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record This is a new, note (again), i call it note 3.
22/09/2010 1. Early duty Record This is again a note, i call it note 2. Apple: 0/less
22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010
3. Nightduty 1.3 Something else here Nursegoals Interventions Record 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record
22/09/2010 3. Nightduty Record This is a new note, i call it note 1.