NLTK 命名实体识别到 Python 列表
NLTK Named Entity recognition to a Python list
我使用 NLTK 的 ne_chunk
从文本中提取命名实体:
my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
nltk.ne_chunk(my_sent, binary=True)
但我不知道如何将这些实体保存到列表中?例如。 –
print Entity_list
('WASHINGTON', 'New York', 'Loretta', 'Brooklyn', 'African')
谢谢。
当您获得 tree
作为 return 值时,我猜您想选择那些标有 NE
的子树
这里是一个简单的例子,可以将所有这些都收集到一个列表中:
import nltk
my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
parse_tree = nltk.ne_chunk(nltk.tag.pos_tag(my_sent.split()), binary=True) # POS tagging before chunking!
named_entities = []
for t in parse_tree.subtrees():
if t.label() == 'NE':
named_entities.append(t)
# named_entities.append(list(t)) # if you want to save a list of tagged words instead of a tree
print named_entities
这给出:
[Tree('NE', [('WASHINGTON', 'NNP')]), Tree('NE', [('New', 'NNP'), ('York', 'NNP')])]
或作为列表的列表:
[[('WASHINGTON', 'NNP')], [('New', 'NNP'), ('York', 'NNP')]]
另见:How to navigate a nltk.tree.Tree?
nltk.ne_chunk
returns 嵌套的 nltk.tree.Tree
对象,因此您必须遍历 Tree
对象才能到达 NE。
看看Named Entity Recognition with Regular Expression: NLTK
>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> from nltk.tree import Tree
>>>
>>> def get_continuous_chunks(text):
... chunked = ne_chunk(pos_tag(word_tokenize(text)))
... continuous_chunk = []
... current_chunk = []
... for i in chunked:
... if type(i) == Tree:
... current_chunk.append(" ".join([token for token, pos in i.leaves()]))
... if current_chunk:
... named_entity = " ".join(current_chunk)
... if named_entity not in continuous_chunk:
... continuous_chunk.append(named_entity)
... current_chunk = []
... else:
... continue
... return continuous_chunk
...
>>> my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
>>> get_continuous_chunks(my_sent)
['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']
>>> my_sent = "How's the weather in New York and Brooklyn"
>>> get_continuous_chunks(my_sent)
['New York', 'Brooklyn']
一个Tree
是一个列表。块是子树,非块词是常规字符串。所以让我们沿着列表往下看,从每个块中提取单词,然后加入它们。
>>> chunked = nltk.ne_chunk(my_sent)
>>>
>>> [ " ".join(w for w, t in elt) for elt in chunked if isinstance(elt, nltk.Tree) ]
['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']
您还可以使用以下代码提取文本中每个名称实体的 label
:
import nltk
for sent in nltk.sent_tokenize(sentence):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
print(chunk.label(), ' '.join(c[0] for c in chunk))
输出:
GPE WASHINGTON
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn
你可以看到Washington
、New York
和Brooklyn
是GPE
意味着地缘政治实体
和Loretta E. Lynch
是一个PERSON
使用 nltk.chunk 中的 tree2conlltags。另外 ne_chunk 需要 pos 标记,它标记单词标记(因此需要 word_tokenize)。
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.chunk import tree2conlltags
sentence = "Mark and John are working at Google."
print(tree2conlltags(ne_chunk(pos_tag(word_tokenize(sentence))
"""[('Mark', 'NNP', 'B-PERSON'),
('and', 'CC', 'O'), ('John', 'NNP', 'B-PERSON'),
('are', 'VBP', 'O'), ('working', 'VBG', 'O'),
('at', 'IN', 'O'), ('Google', 'NNP', 'B-ORGANIZATION'),
('.', '.', 'O')] """
这会给你一个元组列表:[(token, pos_tag, name_entity_tag)]
如果这个列表不是你想要的,那么从这个列表中解析你想要的列表肯定比 nltk 树更容易。
来自 this link 的代码和详细信息;查看更多信息
您也可以只提取单词继续,使用以下函数:
def wordextractor(tuple1):
#bring the tuple back to lists to work with it
words, tags, pos = zip(*tuple1)
words = list(words)
pos = list(pos)
c = list()
i=0
while i<= len(tuple1)-1:
#get words with have pos B-PERSON or I-PERSON
if pos[i] == 'B-PERSON':
c = c+[words[i]]
elif pos[i] == 'I-PERSON':
c = c+[words[i]]
i=i+1
return c
print(wordextractor(tree2conlltags(nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence))))
编辑 添加了输出文档字符串
**编辑* 仅为 B 人添加输出
您也可以考虑使用 Spacy:
import spacy
nlp = spacy.load('en')
doc = nlp('WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement.')
print([ent for ent in doc.ents])
>>> [WASHINGTON, New York, the 1990s, Loretta E. Lynch, Brooklyn, African-Americans]
nltk.ne_chunk returns 嵌套的 nltk.tree.Tree 对象,因此您必须遍历 Tree 对象才能到达 NE。您可以使用列表理解来做同样的事情。
import nltk
my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
word = nltk.word_tokenize(my_sent)
pos_tag = nltk.pos_tag(word)
chunk = nltk.ne_chunk(pos_tag)
NE = [ " ".join(w for w, t in ele) for ele in chunk if isinstance(ele, nltk.Tree)]
print (NE)
我使用 NLTK 的 ne_chunk
从文本中提取命名实体:
my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
nltk.ne_chunk(my_sent, binary=True)
但我不知道如何将这些实体保存到列表中?例如。 –
print Entity_list
('WASHINGTON', 'New York', 'Loretta', 'Brooklyn', 'African')
谢谢。
当您获得 tree
作为 return 值时,我猜您想选择那些标有 NE
这里是一个简单的例子,可以将所有这些都收集到一个列表中:
import nltk
my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
parse_tree = nltk.ne_chunk(nltk.tag.pos_tag(my_sent.split()), binary=True) # POS tagging before chunking!
named_entities = []
for t in parse_tree.subtrees():
if t.label() == 'NE':
named_entities.append(t)
# named_entities.append(list(t)) # if you want to save a list of tagged words instead of a tree
print named_entities
这给出:
[Tree('NE', [('WASHINGTON', 'NNP')]), Tree('NE', [('New', 'NNP'), ('York', 'NNP')])]
或作为列表的列表:
[[('WASHINGTON', 'NNP')], [('New', 'NNP'), ('York', 'NNP')]]
另见:How to navigate a nltk.tree.Tree?
nltk.ne_chunk
returns 嵌套的 nltk.tree.Tree
对象,因此您必须遍历 Tree
对象才能到达 NE。
看看Named Entity Recognition with Regular Expression: NLTK
>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> from nltk.tree import Tree
>>>
>>> def get_continuous_chunks(text):
... chunked = ne_chunk(pos_tag(word_tokenize(text)))
... continuous_chunk = []
... current_chunk = []
... for i in chunked:
... if type(i) == Tree:
... current_chunk.append(" ".join([token for token, pos in i.leaves()]))
... if current_chunk:
... named_entity = " ".join(current_chunk)
... if named_entity not in continuous_chunk:
... continuous_chunk.append(named_entity)
... current_chunk = []
... else:
... continue
... return continuous_chunk
...
>>> my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
>>> get_continuous_chunks(my_sent)
['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']
>>> my_sent = "How's the weather in New York and Brooklyn"
>>> get_continuous_chunks(my_sent)
['New York', 'Brooklyn']
一个Tree
是一个列表。块是子树,非块词是常规字符串。所以让我们沿着列表往下看,从每个块中提取单词,然后加入它们。
>>> chunked = nltk.ne_chunk(my_sent)
>>>
>>> [ " ".join(w for w, t in elt) for elt in chunked if isinstance(elt, nltk.Tree) ]
['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']
您还可以使用以下代码提取文本中每个名称实体的 label
:
import nltk
for sent in nltk.sent_tokenize(sentence):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
print(chunk.label(), ' '.join(c[0] for c in chunk))
输出:
GPE WASHINGTON
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn
你可以看到Washington
、New York
和Brooklyn
是GPE
意味着地缘政治实体
和Loretta E. Lynch
是一个PERSON
使用 nltk.chunk 中的 tree2conlltags。另外 ne_chunk 需要 pos 标记,它标记单词标记(因此需要 word_tokenize)。
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.chunk import tree2conlltags
sentence = "Mark and John are working at Google."
print(tree2conlltags(ne_chunk(pos_tag(word_tokenize(sentence))
"""[('Mark', 'NNP', 'B-PERSON'),
('and', 'CC', 'O'), ('John', 'NNP', 'B-PERSON'),
('are', 'VBP', 'O'), ('working', 'VBG', 'O'),
('at', 'IN', 'O'), ('Google', 'NNP', 'B-ORGANIZATION'),
('.', '.', 'O')] """
这会给你一个元组列表:[(token, pos_tag, name_entity_tag)] 如果这个列表不是你想要的,那么从这个列表中解析你想要的列表肯定比 nltk 树更容易。
来自 this link 的代码和详细信息;查看更多信息
您也可以只提取单词继续,使用以下函数:
def wordextractor(tuple1):
#bring the tuple back to lists to work with it
words, tags, pos = zip(*tuple1)
words = list(words)
pos = list(pos)
c = list()
i=0
while i<= len(tuple1)-1:
#get words with have pos B-PERSON or I-PERSON
if pos[i] == 'B-PERSON':
c = c+[words[i]]
elif pos[i] == 'I-PERSON':
c = c+[words[i]]
i=i+1
return c
print(wordextractor(tree2conlltags(nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence))))
编辑 添加了输出文档字符串 **编辑* 仅为 B 人添加输出
您也可以考虑使用 Spacy:
import spacy
nlp = spacy.load('en')
doc = nlp('WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement.')
print([ent for ent in doc.ents])
>>> [WASHINGTON, New York, the 1990s, Loretta E. Lynch, Brooklyn, African-Americans]
nltk.ne_chunk returns 嵌套的 nltk.tree.Tree 对象,因此您必须遍历 Tree 对象才能到达 NE。您可以使用列表理解来做同样的事情。
import nltk
my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
word = nltk.word_tokenize(my_sent)
pos_tag = nltk.pos_tag(word)
chunk = nltk.ne_chunk(pos_tag)
NE = [ " ".join(w for w, t in ele) for ele in chunk if isinstance(ele, nltk.Tree)]
print (NE)