用Python提取科学论文信息?
Science paper information extraction with Python?
我是 Python 的新手,我刚好需要从几篇科学论文中提取一些信息。
如果给定纯文本,例如:
- 简介
一些长篇文章
- 方法论
一些长篇文章
- 结果
一些长篇文章
如何将一篇论文像下面这样放入字典中?
paper_1 = {
'Introduction': some long writings,
'Methodology': some long writings,
'Results': some long writings
}
非常感谢:-)
尝试后,我得到了一些代码 运行 但它并不完美:
text = 'introduction This is the FIRST part.' \
'Methodologies This is the SECOND part.' \
'results This is the THIRD part.'
import re
from re import finditer
d={}
first =[]
second =[]
title_list=[]
all =[]
for match in finditer("Methodology|results|methodologies|introduction|", text, re.IGNORECASE):
if match.group() is not '':
title = match.group()
location = match.span()
first.append(location[0])
second.append(location[1])
title_list.append(title)
all.append(first)
all.append(second)
a=[]
for i in range(2):
j = i+1
section = text[all[1][i]:all[0][j]]
a.append(section)
for i in zip(title_list, a):
d[i[0]] = i[1]
print (d)
这将产生以下结果:
{
'introduction': ' This is the FIRST part.',
'Methodologies': ' This is the SECOND part.'
}
然而,
i) 它无法提取最后一位,即结果部分。
ii).在循环中,我给 range() 函数输入了 2,因为我知道只有 3 个部分(介绍、方法和结果),但在某些论文中,人们会添加更多部分,我如何自动将正确的值分配给范围()?例如,某些论文可能包含以下部分:
- 简介
一些长篇文章
- 关于某事的一般背景
一些长篇文章
- 某种章节标题
一些长篇文章
- 方法论
一些长篇文章
- 结果
一些长篇文章
iii).有没有更有效的方法可以在每个循环中构建字典?所以我不需要使用第二个循环。
30-03-2018 更新:
代码更新如下:
def section_detection(text):
title_list=[]
all =[[],[]]
dic={}
count = 0
pattern = '\d\. [A-Z][a-z]*'
for match in finditer(pattern, text, re.IGNORECASE):
if match.group() is not '':
all[0].append(match.span()[0])
all[1].append(match.span()[1])
title_list.append(match.group())
count += 1
for i in range(count):
j = i+1
try:
dic[title_list[i]]=text[all[1][i]:all[0][j]]
except IndexError:
dic[title_list[i]]=text[all[1][i]:]
return dic
如果执行如下:
import re
from re import finditer
text = '1. introduction This is the FIRST part.' \
'2. Methodologies This is the SECOND part.' \
'3. results This is the THIRD part.'\
'4. somesection This SOME section'
dic = section_detection(text)
print(dic)
给出:
{'1. introduction': ' This is the FIRST part.', '2. Methodologies': ' This is the SECOND part.', '3. results': ' This is the THIRD part.', '4. somesection': ' This SOME section'}
非常感谢大家! :-)
试试这个:
text = 'introduction This is the FIRST part. ' \
'Methodologies This is the SECOND part. ' \
'results This is the THIRD part. ' \
import re
kw = ['methodology', 'results', 'methodologies', 'introduction']
pat = re.compile(r'(%s)' % '|'.join(kw), re.IGNORECASE)
sp = [x for x in re.split(pat, text) if x]
dic = {k:v for k,v in zip(sp[0::2],sp[1::2])}
print(dic)
但这只是为了你的例子,在现实世界的文档中不要太多。你没有说明,"Introduction"之前的文字是什么,有人用纯文本提到"result"又是什么?
非常喜欢@Franz Forstmayr 编写的正则表达式。只是想指出一些打破它的方法。
text = '''
introduction This is the FIRST part.
introductionMethodologies This is the SECOND part.
results This is the THIRD part.
'''
import re
#### Regex based on
kw = ['methodology', 'results', 'methodologies', 'introduction']
pat = re.compile(r'(%s)' % '|'.join(kw), re.IGNORECASE)
sp = [x for x in re.split(pat, text) if x]
print sp
dic = {k:v for k,v in zip(sp[0::2],sp[1::2])}
print(dic)
# {'\n': 'introduction',
# 'Methodologies': ' This is the SECOND part.\n',
# ' This is the FIRST part.\n': 'introduction',
# 'results': ' This is the THIRD part.\n'}
您可以看到列表因\n 字符而移位并且字典已损坏。因此我建议放置硬切片
out = re.split(pat, text)
lead = out[0:1]; ### Keep the lead available in case needed
sp = out[1:]
print sp
dic = {k:v for k,v in zip(sp[0::2],sp[1::2])}
print(dic)
# {'introduction': '',
# 'Methodologies': ' This is the SECOND part.\n',
# 'results': ' This is the THIRD part.\n'}
我是 Python 的新手,我刚好需要从几篇科学论文中提取一些信息。
如果给定纯文本,例如:
- 简介
一些长篇文章 - 方法论
一些长篇文章 - 结果
一些长篇文章
如何将一篇论文像下面这样放入字典中?
paper_1 = {
'Introduction': some long writings,
'Methodology': some long writings,
'Results': some long writings
}
非常感谢:-)
尝试后,我得到了一些代码 运行 但它并不完美:
text = 'introduction This is the FIRST part.' \
'Methodologies This is the SECOND part.' \
'results This is the THIRD part.'
import re
from re import finditer
d={}
first =[]
second =[]
title_list=[]
all =[]
for match in finditer("Methodology|results|methodologies|introduction|", text, re.IGNORECASE):
if match.group() is not '':
title = match.group()
location = match.span()
first.append(location[0])
second.append(location[1])
title_list.append(title)
all.append(first)
all.append(second)
a=[]
for i in range(2):
j = i+1
section = text[all[1][i]:all[0][j]]
a.append(section)
for i in zip(title_list, a):
d[i[0]] = i[1]
print (d)
这将产生以下结果:
{
'introduction': ' This is the FIRST part.',
'Methodologies': ' This is the SECOND part.'
}
然而,
i) 它无法提取最后一位,即结果部分。
ii).在循环中,我给 range() 函数输入了 2,因为我知道只有 3 个部分(介绍、方法和结果),但在某些论文中,人们会添加更多部分,我如何自动将正确的值分配给范围()?例如,某些论文可能包含以下部分:
- 简介
一些长篇文章 - 关于某事的一般背景
一些长篇文章 - 某种章节标题
一些长篇文章 - 方法论
一些长篇文章 - 结果
一些长篇文章
iii).有没有更有效的方法可以在每个循环中构建字典?所以我不需要使用第二个循环。
30-03-2018 更新:
代码更新如下:
def section_detection(text):
title_list=[]
all =[[],[]]
dic={}
count = 0
pattern = '\d\. [A-Z][a-z]*'
for match in finditer(pattern, text, re.IGNORECASE):
if match.group() is not '':
all[0].append(match.span()[0])
all[1].append(match.span()[1])
title_list.append(match.group())
count += 1
for i in range(count):
j = i+1
try:
dic[title_list[i]]=text[all[1][i]:all[0][j]]
except IndexError:
dic[title_list[i]]=text[all[1][i]:]
return dic
如果执行如下:
import re
from re import finditer
text = '1. introduction This is the FIRST part.' \
'2. Methodologies This is the SECOND part.' \
'3. results This is the THIRD part.'\
'4. somesection This SOME section'
dic = section_detection(text)
print(dic)
给出:
{'1. introduction': ' This is the FIRST part.', '2. Methodologies': ' This is the SECOND part.', '3. results': ' This is the THIRD part.', '4. somesection': ' This SOME section'}
非常感谢大家! :-)
试试这个:
text = 'introduction This is the FIRST part. ' \
'Methodologies This is the SECOND part. ' \
'results This is the THIRD part. ' \
import re
kw = ['methodology', 'results', 'methodologies', 'introduction']
pat = re.compile(r'(%s)' % '|'.join(kw), re.IGNORECASE)
sp = [x for x in re.split(pat, text) if x]
dic = {k:v for k,v in zip(sp[0::2],sp[1::2])}
print(dic)
但这只是为了你的例子,在现实世界的文档中不要太多。你没有说明,"Introduction"之前的文字是什么,有人用纯文本提到"result"又是什么?
非常喜欢@Franz Forstmayr 编写的正则表达式。只是想指出一些打破它的方法。
text = '''
introduction This is the FIRST part.
introductionMethodologies This is the SECOND part.
results This is the THIRD part.
'''
import re
#### Regex based on
kw = ['methodology', 'results', 'methodologies', 'introduction']
pat = re.compile(r'(%s)' % '|'.join(kw), re.IGNORECASE)
sp = [x for x in re.split(pat, text) if x]
print sp
dic = {k:v for k,v in zip(sp[0::2],sp[1::2])}
print(dic)
# {'\n': 'introduction',
# 'Methodologies': ' This is the SECOND part.\n',
# ' This is the FIRST part.\n': 'introduction',
# 'results': ' This is the THIRD part.\n'}
您可以看到列表因\n 字符而移位并且字典已损坏。因此我建议放置硬切片
out = re.split(pat, text)
lead = out[0:1]; ### Keep the lead available in case needed
sp = out[1:]
print sp
dic = {k:v for k,v in zip(sp[0::2],sp[1::2])}
print(dic)
# {'introduction': '',
# 'Methodologies': ' This is the SECOND part.\n',
# 'results': ' This is the THIRD part.\n'}