如何在 CSV 文件中添加使用正则表达式找到的信息
How to add information found with regular expressions in a CSV file
我正在尝试将新信息“附加”到 CSV 文件中。问题在于该信息不在数据帧结构中,而是使用正则表达式从文本中提取的信息。示例文本将是下一个:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam id diam
posuere, eleifend diam at, condimentum justo. Pellentesque mollis a
diam id consequat.
TITLE-SDFSD-DFDS-SFDS-01-01: This is the title 1 that
is split into two lines with a blank line in the middle
Conditions Pellentesque blandit scelerisque pellentesque. Sed nec quam
purus. Quisque nec tellus sed neque accumsan lacinia sit amet sit amet
tellus. Etiam venenatis nibh vel pellentesque elementum. Nullam eget
tortor quam. Morbi sed leo et arcu aliquet luctus.
Opening date 15 Apr 2021
Deadline 26 Aug 2021
Indicative budget: The total indicative budget for the topic is EUR
20.00 million.
TITLE-SDFSD-DFDS-SFDS-01-02; This is the title2 in one single line
Conditions Cras egestas consectetur sapien at dignissim. Maecenas
commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum
dolor neque, sagittis ut tortor et, lobortis faucibus quam.
Opening date 15 March 2021
Deadline 17 Aug 2021
Indicative budget: The total indicative budget for the topic is EUR
15.00 million.
TITLE-SDFSD-DFDS-SFDS-01-03: This is the title3 that is too long and takes
two lines
Conditions Cras egestas consectetur sapien at dignissim. Maecenas
commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum
dolor neque, sagittis ut tortor et, lobortis faucibus quam.
Opening date 15 May 2021
Deadline 26 Sep 2021
Indicative budget: The total indicative budget for the topic is EUR
5.00 million.
要提取所有信息,我必须进行多次交互才能提取我需要的信息。我知道可以将一次迭代细分为我需要的几组,但我很难找到一个有效的正则表达式。相反,我正在使用其中的几个:
import re
import csv
with open('doubt2.txt','r', encoding="utf-8") as f:
f_contents = f.read()
regexHOR =r'\n(TITLE-\S+-\d{2}-\d{2})[:|;](.*?)^Conditions'
regexOD = r'^Opening date\s+(\d{1,2} \w+ \d{4})\s*?'
regexDL =r'^Deadline\s+(\d+ \w+ \d+)'
patternHOR = re.compile(regexHOR, re.MULTILINE | re.DOTALL)
patternOD = re.compile(regexOD, re.MULTILINE | re.DOTALL)
patternDL = re.compile(regexDL, re.MULTILINE | re.DOTALL)
matchesHOR = patternHOR.finditer(f_contents)
matchesOD = patternOD.finditer(f_contents)
matchesDL = patternDL.finditer(f_contents)
marchesHOR
找到两组,而其他匹配项只是一组。找到匹配项后,我必须将其导出到 CSV 文件中,执行下一个代码:
with open("result.csv", "w",newline='') as outfile:
csvfile = csv.writer(outfile)
csvfile.writerow(['Topic ID', 'Title', 'Opening date', 'Deadline'])
for match in matchesHOR:
csvfile.writerow([match.group(1), match.group(2).replace('\n', ' '),'',''])
for match in matchesOD:
csvfile.writerow(['','',match.group(1),''])
for match in matchesDL:
csvfile.writerow(['','','',match.group(1)])
问题是,当我在 matchesHOR
之后写新的 nows 时,它把我放在下面,正如你在这个 table:
中看到的
CODE
TITLE
Opening
Deadline
CODE 1
TITLE 1
CODE 2
TITLE 2
CODE 3
TITLE 3
OPENING 1
OPENING 2
OPENING 3
DEADLINE 1
DEADLINE 2
DEADLINE 3
欢迎任何额外的评论来执行四个交互来识别几个组
您需要稍微重新安排一下,以便同时为一行写入所有项目。这里的做法是用 match_hor
找到每个标题的开始,然后以此作为 match_od
的起点,进而作为 [=14= 的起点].
import re
import csv
with open('doubt2.txt','r', encoding="utf-8") as f:
f_contents = f.read()
regexHOR = r'\n(TITLE-\S+-\d{2}-\d{2})[:|;](.*?)^Conditions'
regexOD = r'^Opening date\s+(\d{1,2} \w+ \d{4})\s*?'
regexDL =r'^Deadline\s+(\d+ \w+ \d+)'
patternHOR = re.compile(regexHOR, re.MULTILINE | re.DOTALL)
patternOD = re.compile(regexOD, re.MULTILINE | re.DOTALL)
patternDL = re.compile(regexDL, re.MULTILINE | re.DOTALL)
with open("result.csv", "w",newline='') as outfile:
csvfile = csv.writer(outfile)
csvfile.writerow(['Topic ID', 'Title', 'Opening date', 'Deadline'])
for match_hor in patternHOR.finditer(f_contents):
code, title = [match_hor.group(1), match_hor.group(2).replace('\n', ' ')]
offset = match_hor.end()
match_od = patternOD.search(f_contents[offset:])
offset += match_od.end()
opening = match_od.group(1)
match_dl = patternDL.search(f_contents[offset:])
offset += match_dl.end()
deadline = match_dl.group(1)
csvfile.writerow([code, title.strip(), opening, deadline])
这会给你 result.csv
包含:
Topic ID,Title,Opening date,Deadline
TITLE-SDFSD-DFDS-SFDS-01-01,This is the title 1 that is split into two lines with a blank line in the middle,15 Apr 2021,26 Aug 2021
TITLE-SDFSD-DFDS-SFDS-01-02,This is the title2 in one single line,15 March 2021,17 Aug 2021
TITLE-SDFSD-DFDS-SFDS-01-03,This is the title3 that is too long and takes two lines,15 May 2021,26 Sep 2021
我向您推荐以下代码,使用positive lookahead, lookbehind and namedgroup,如下:
>>> regexHOR = r'(?P<TopicID>TITLE-\S+-\d{2}-\d{2})[:;]\s*(?P<Title>[\w\s]+(?=Conditions))'
>>>
>>> regexOD = r'(?P<OpeningDate>(?<=Opening date )\d{1,2} \w+ \d{4})'
>>>
>>> regexDL = r'(?P<DeadLine>(?<=Deadline )\d+ \w+ \d+)'
>>>
>>>regex_pattern = re.compile('.*?'.join([regexHOR, regexOD, regexDL]), re.MULTILINE | re.DOTALL)
>>>
>>> for match in re.finditer(regex_pattern, f_contents):
csvfile.writerow([match.group('TopicID'), match.group('Title'), \
match.group('OpeningDate'), match.group('DeadLine')])
每次调用 csvfile.writerow
时,都会写入一个新行,这就是为什么您没有将每个循环迭代的所有项目都写入同一行的原因。
我正在尝试将新信息“附加”到 CSV 文件中。问题在于该信息不在数据帧结构中,而是使用正则表达式从文本中提取的信息。示例文本将是下一个:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam id diam posuere, eleifend diam at, condimentum justo. Pellentesque mollis a diam id consequat.
TITLE-SDFSD-DFDS-SFDS-01-01: This is the title 1 that
is split into two lines with a blank line in the middle
Conditions Pellentesque blandit scelerisque pellentesque. Sed nec quam purus. Quisque nec tellus sed neque accumsan lacinia sit amet sit amet tellus. Etiam venenatis nibh vel pellentesque elementum. Nullam eget tortor quam. Morbi sed leo et arcu aliquet luctus.
Opening date 15 Apr 2021
Deadline 26 Aug 2021
Indicative budget: The total indicative budget for the topic is EUR 20.00 million.
TITLE-SDFSD-DFDS-SFDS-01-02; This is the title2 in one single line
Conditions Cras egestas consectetur sapien at dignissim. Maecenas commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum dolor neque, sagittis ut tortor et, lobortis faucibus quam.
Opening date 15 March 2021
Deadline 17 Aug 2021
Indicative budget: The total indicative budget for the topic is EUR 15.00 million.
TITLE-SDFSD-DFDS-SFDS-01-03: This is the title3 that is too long and takes two lines
Conditions Cras egestas consectetur sapien at dignissim. Maecenas commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum dolor neque, sagittis ut tortor et, lobortis faucibus quam.
Opening date 15 May 2021
Deadline 26 Sep 2021
Indicative budget: The total indicative budget for the topic is EUR 5.00 million.
要提取所有信息,我必须进行多次交互才能提取我需要的信息。我知道可以将一次迭代细分为我需要的几组,但我很难找到一个有效的正则表达式。相反,我正在使用其中的几个:
import re
import csv
with open('doubt2.txt','r', encoding="utf-8") as f:
f_contents = f.read()
regexHOR =r'\n(TITLE-\S+-\d{2}-\d{2})[:|;](.*?)^Conditions'
regexOD = r'^Opening date\s+(\d{1,2} \w+ \d{4})\s*?'
regexDL =r'^Deadline\s+(\d+ \w+ \d+)'
patternHOR = re.compile(regexHOR, re.MULTILINE | re.DOTALL)
patternOD = re.compile(regexOD, re.MULTILINE | re.DOTALL)
patternDL = re.compile(regexDL, re.MULTILINE | re.DOTALL)
matchesHOR = patternHOR.finditer(f_contents)
matchesOD = patternOD.finditer(f_contents)
matchesDL = patternDL.finditer(f_contents)
marchesHOR
找到两组,而其他匹配项只是一组。找到匹配项后,我必须将其导出到 CSV 文件中,执行下一个代码:
with open("result.csv", "w",newline='') as outfile:
csvfile = csv.writer(outfile)
csvfile.writerow(['Topic ID', 'Title', 'Opening date', 'Deadline'])
for match in matchesHOR:
csvfile.writerow([match.group(1), match.group(2).replace('\n', ' '),'',''])
for match in matchesOD:
csvfile.writerow(['','',match.group(1),''])
for match in matchesDL:
csvfile.writerow(['','','',match.group(1)])
问题是,当我在 matchesHOR
之后写新的 nows 时,它把我放在下面,正如你在这个 table:
CODE | TITLE | Opening | Deadline |
---|---|---|---|
CODE 1 | TITLE 1 | ||
CODE 2 | TITLE 2 | ||
CODE 3 | TITLE 3 | ||
OPENING 1 | |||
OPENING 2 | |||
OPENING 3 | |||
DEADLINE 1 | |||
DEADLINE 2 | |||
DEADLINE 3 |
欢迎任何额外的评论来执行四个交互来识别几个组
您需要稍微重新安排一下,以便同时为一行写入所有项目。这里的做法是用 match_hor
找到每个标题的开始,然后以此作为 match_od
的起点,进而作为 [=14= 的起点].
import re
import csv
with open('doubt2.txt','r', encoding="utf-8") as f:
f_contents = f.read()
regexHOR = r'\n(TITLE-\S+-\d{2}-\d{2})[:|;](.*?)^Conditions'
regexOD = r'^Opening date\s+(\d{1,2} \w+ \d{4})\s*?'
regexDL =r'^Deadline\s+(\d+ \w+ \d+)'
patternHOR = re.compile(regexHOR, re.MULTILINE | re.DOTALL)
patternOD = re.compile(regexOD, re.MULTILINE | re.DOTALL)
patternDL = re.compile(regexDL, re.MULTILINE | re.DOTALL)
with open("result.csv", "w",newline='') as outfile:
csvfile = csv.writer(outfile)
csvfile.writerow(['Topic ID', 'Title', 'Opening date', 'Deadline'])
for match_hor in patternHOR.finditer(f_contents):
code, title = [match_hor.group(1), match_hor.group(2).replace('\n', ' ')]
offset = match_hor.end()
match_od = patternOD.search(f_contents[offset:])
offset += match_od.end()
opening = match_od.group(1)
match_dl = patternDL.search(f_contents[offset:])
offset += match_dl.end()
deadline = match_dl.group(1)
csvfile.writerow([code, title.strip(), opening, deadline])
这会给你 result.csv
包含:
Topic ID,Title,Opening date,Deadline
TITLE-SDFSD-DFDS-SFDS-01-01,This is the title 1 that is split into two lines with a blank line in the middle,15 Apr 2021,26 Aug 2021
TITLE-SDFSD-DFDS-SFDS-01-02,This is the title2 in one single line,15 March 2021,17 Aug 2021
TITLE-SDFSD-DFDS-SFDS-01-03,This is the title3 that is too long and takes two lines,15 May 2021,26 Sep 2021
我向您推荐以下代码,使用positive lookahead, lookbehind and namedgroup,如下:
>>> regexHOR = r'(?P<TopicID>TITLE-\S+-\d{2}-\d{2})[:;]\s*(?P<Title>[\w\s]+(?=Conditions))'
>>>
>>> regexOD = r'(?P<OpeningDate>(?<=Opening date )\d{1,2} \w+ \d{4})'
>>>
>>> regexDL = r'(?P<DeadLine>(?<=Deadline )\d+ \w+ \d+)'
>>>
>>>regex_pattern = re.compile('.*?'.join([regexHOR, regexOD, regexDL]), re.MULTILINE | re.DOTALL)
>>>
>>> for match in re.finditer(regex_pattern, f_contents):
csvfile.writerow([match.group('TopicID'), match.group('Title'), \
match.group('OpeningDate'), match.group('DeadLine')])
每次调用 csvfile.writerow
时,都会写入一个新行,这就是为什么您没有将每个循环迭代的所有项目都写入同一行的原因。