获取找到的命名实体的开始和结束位置

Get the start and end position of found named entities

我对 ML 和 Spacy 都很陌生。我正在尝试从输入文本中显示 命名实体

这是我的方法:

def run():

    nlp = spacy.load('en_core_web_sm')
    sentence = "Hi my name is Oliver!"
    doc = nlp(sentence)

    #Threshold for the confidence socres.
    threshold = 0.2
    beams = nlp.entity.beam_parse(
        [doc], beam_width=16, beam_density=0.0001)

    entity_scores = defaultdict(float)
    for beam in beams:
        for score, ents in nlp.entity.moves.get_beam_parses(beam):
            for start, end, label in ents:
                entity_scores[(start, end, label)] += score

    #Create a dict to store output.
    ners = defaultdict(list)
    ners['text'] = str(sentence)

    for key in entity_scores:
        start, end, label = key
        score = entity_scores[key]
        if (score > threshold):
            ners['extractions'].append({
                "label": str(label),
                "text": str(doc[start:end]),
                "confidence": round(score, 2)
            })

    pprint(ners)

上述方法工作正常,将打印如下内容:

'extractions': [{'confidence': 1.0,
                'label': 'PERSON',
                'text': 'Oliver'}],
'text': 'Hi my name is Oliver'})

到目前为止一切顺利。现在我正在尝试获取找到的命名实体的实际位置。在这种情况下 "Oliver".

查看 documentation,有:ent.start_char, ent.end_char 可用,但如果我使用它:

"start_position": doc.start_char,
"end_position": doc.end_char

我收到以下错误:

AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'start_char'

有人可以指导我正确的方向吗?

所以我实际上在发布这个问题后就找到了答案(典型)。

我发现我不需要将信息保存到 entity_scores 中,而只是遍历实际找到的实体 ent:

我最终添加了 for ent in doc.ents:,这让我可以访问所有标准的 Spacy attributes。见下文:

ners = defaultdict(list)
ners['text'] = str(sentence)
for beam in beams:
    for score, ents in nlp.entity.moves.get_beam_parses(beam):
        for ent in doc.ents:
            if (score > threshold):
                ners['extractions'].append({
                    "label": str(ent.label_),
                    "text": str(ent.text),
                    "confidence": round(score, 2),
                    "start_position": ent.start_char,
                    "end_position": ent.end_char

我的整个方法最终看起来像这样:

def run():
    nlp = spacy.load('en_core_web_sm')
    sentence = "Hi my name is Oliver!"
    doc = nlp(sentence)

    threshold = 0.2
    beams = nlp.entity.beam_parse(
        [doc], beam_width=16, beam_density=0.0001)

    ners = defaultdict(list)
    ners['text'] = str(sentence)
    for beam in beams:
        for score, ents in nlp.entity.moves.get_beam_parses(beam):
            for ent in doc.ents:
                if (score > threshold):
                    ners['extractions'].append({
                        "label": str(ent.label_),
                        "text": str(ent.text),
                        "confidence": round(score, 2),
                        "start_position": ent.start_char,
                        "end_position": ent.end_char
                    })

如果有人来这里想要一个简单的问题答案,我相信应该这样做:

nlp = spacy.load('en_core_web_sm')
sentence = "Hi my name is Oliver!"
doc = nlp(sentence)

for ent in doc.ents:
    print(f"Entity {ent} found with start at {ent.start_char} and end at {ent.end_char}")