如何将实体(列表)转换为字典?我试过的代码被注释掉了,但无法正常工作,NLP 问题
how can I convert entities(list) to dictionary? my tried code is commented and not working, NLP problem
如何将实体(列表)转换为字典?我试过的代码被注释掉了,但没有工作,或者不是转换我如何重写实体,使其像字典一样?
我想在字典中进行转换,以便能够在前 500 个句子中找到 5 个最常被提及的人。
! pip install wget
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/moby_dick.txt'
wget.download(url, 'moby_dick.txt')
documents = [line.strip() for line in open('moby_dick.txt', encoding='utf8').readlines()]
import spacy
nlp = spacy.load('en')
entities = [[(entity.text, entity.label_) for entity in nlp(sentence).ents]for sentence in documents[:50]]
entities
#I TRIED THIS BUT IS WRONG
#def Convert(lst):
# res_dct = {lst[i]: lst[i + 1] for i in range(0, len(lst), 2)}
# return res_dct
#print(Convert(ent))
存储在变量 entities
中的列表具有类型 list[list[tuple[str, str]]]
,其中元组中的第一个条目是实体的字符串,第二个是实体的类型,例如:
>>> from pprint import pprint
>>> pprint(entities)
[[],
[('Ishmael', 'GPE')],
[('Some years ago', 'DATE')],
[],
[('November', 'DATE')],
[],
[('Cato', 'ORG')],
[],
[],
[('Manhattoes', 'ORG'), ('Indian', 'NORP')],
[],
[('a few hours', 'TIME')],
...
然后你可以通过以下方式创建反向dict
:
>>> sum(filter(None, entities), [])
[('Ishmael', 'GPE'), ('Some years ago', 'DATE'), ('November', 'DATE'), ('Cato', 'ORG'), ('Manhattoes', 'ORG'), ('Indian', 'NORP'), ('a few hours', 'TIME'), ('Sabbath afternoon', 'TIME'), ('Corlears Hook to Coenties Slip', 'WORK_OF_ART'), ('Whitehall', 'PERSON'), ('thousands upon thousands', 'CARDINAL'), ('China', 'GPE'), ('week days', 'DATE'), ('ten', 'CARDINAL'), ('American', 'NORP'), ('June', 'DATE'), ('one', 'CARDINAL'), ('Niagara', 'ORG'), ('thousand miles', 'QUANTITY'), ('Tennessee', 'GPE'), ('two', 'CARDINAL'), ('Rockaway Beach', 'GPE'), ('first', 'ORDINAL'), ('first', 'ORDINAL'), ('Persians', 'NORP')]
>>> from collections import defaultdict
>>> type2entities = defaultdict(list)
>>> for entity, entity_type in sum(filter(None, entities), []):
... type2entities[entity_type].append(entity)
...
>>> from pprint import pprint
>>> pprint(type2entities)
defaultdict(<class 'list'>,
{'CARDINAL': ['thousands upon thousands', 'ten', 'one', 'two'],
'DATE': ['Some years ago', 'November', 'week days', 'June'],
'GPE': ['Ishmael', 'China', 'Tennessee', 'Rockaway Beach'],
'NORP': ['Indian', 'American', 'Persians'],
'ORDINAL': ['first', 'first'],
'ORG': ['Cato', 'Manhattoes', 'Niagara'],
'PERSON': ['Whitehall'],
'QUANTITY': ['thousand miles'],
'TIME': ['a few hours', 'Sabbath afternoon'],
'WORK_OF_ART': ['Corlears Hook to Coenties Slip']})
存储在变量type2entities
中的dict
就是你想要的。获取前 500 行中出现次数最多的人名(及其相应的提及次数):
>>> from collections import Counter
>>> entities = [[(entity.text, entity.label_) for entity in nlp(sentence).ents]for sentence in documents[:500]]
>>> person_cnt = Counter()
>>> for entity, entity_type in sum(filter(None, entities), []):
... if entity_type == 'PERSON':
... person_cnt[entity] += 1
...
>>> person_cnt.most_common(5)
[('Queequeg', 17), ('don', 4), ('Nantucket', 2), ('Jonah', 2), ('Sal', 2)]
如何将实体(列表)转换为字典?我试过的代码被注释掉了,但没有工作,或者不是转换我如何重写实体,使其像字典一样? 我想在字典中进行转换,以便能够在前 500 个句子中找到 5 个最常被提及的人。
! pip install wget
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/moby_dick.txt'
wget.download(url, 'moby_dick.txt')
documents = [line.strip() for line in open('moby_dick.txt', encoding='utf8').readlines()]
import spacy
nlp = spacy.load('en')
entities = [[(entity.text, entity.label_) for entity in nlp(sentence).ents]for sentence in documents[:50]]
entities
#I TRIED THIS BUT IS WRONG
#def Convert(lst):
# res_dct = {lst[i]: lst[i + 1] for i in range(0, len(lst), 2)}
# return res_dct
#print(Convert(ent))
存储在变量 entities
中的列表具有类型 list[list[tuple[str, str]]]
,其中元组中的第一个条目是实体的字符串,第二个是实体的类型,例如:
>>> from pprint import pprint
>>> pprint(entities)
[[],
[('Ishmael', 'GPE')],
[('Some years ago', 'DATE')],
[],
[('November', 'DATE')],
[],
[('Cato', 'ORG')],
[],
[],
[('Manhattoes', 'ORG'), ('Indian', 'NORP')],
[],
[('a few hours', 'TIME')],
...
然后你可以通过以下方式创建反向dict
:
>>> sum(filter(None, entities), [])
[('Ishmael', 'GPE'), ('Some years ago', 'DATE'), ('November', 'DATE'), ('Cato', 'ORG'), ('Manhattoes', 'ORG'), ('Indian', 'NORP'), ('a few hours', 'TIME'), ('Sabbath afternoon', 'TIME'), ('Corlears Hook to Coenties Slip', 'WORK_OF_ART'), ('Whitehall', 'PERSON'), ('thousands upon thousands', 'CARDINAL'), ('China', 'GPE'), ('week days', 'DATE'), ('ten', 'CARDINAL'), ('American', 'NORP'), ('June', 'DATE'), ('one', 'CARDINAL'), ('Niagara', 'ORG'), ('thousand miles', 'QUANTITY'), ('Tennessee', 'GPE'), ('two', 'CARDINAL'), ('Rockaway Beach', 'GPE'), ('first', 'ORDINAL'), ('first', 'ORDINAL'), ('Persians', 'NORP')]
>>> from collections import defaultdict
>>> type2entities = defaultdict(list)
>>> for entity, entity_type in sum(filter(None, entities), []):
... type2entities[entity_type].append(entity)
...
>>> from pprint import pprint
>>> pprint(type2entities)
defaultdict(<class 'list'>,
{'CARDINAL': ['thousands upon thousands', 'ten', 'one', 'two'],
'DATE': ['Some years ago', 'November', 'week days', 'June'],
'GPE': ['Ishmael', 'China', 'Tennessee', 'Rockaway Beach'],
'NORP': ['Indian', 'American', 'Persians'],
'ORDINAL': ['first', 'first'],
'ORG': ['Cato', 'Manhattoes', 'Niagara'],
'PERSON': ['Whitehall'],
'QUANTITY': ['thousand miles'],
'TIME': ['a few hours', 'Sabbath afternoon'],
'WORK_OF_ART': ['Corlears Hook to Coenties Slip']})
存储在变量type2entities
中的dict
就是你想要的。获取前 500 行中出现次数最多的人名(及其相应的提及次数):
>>> from collections import Counter
>>> entities = [[(entity.text, entity.label_) for entity in nlp(sentence).ents]for sentence in documents[:500]]
>>> person_cnt = Counter()
>>> for entity, entity_type in sum(filter(None, entities), []):
... if entity_type == 'PERSON':
... person_cnt[entity] += 1
...
>>> person_cnt.most_common(5)
[('Queequeg', 17), ('don', 4), ('Nantucket', 2), ('Jonah', 2), ('Sal', 2)]