如何在没有 IOB 标签的情况下使用 Hugging Face 的变压器管道重建文本实体？

Question

我一直在寻找将 Hugging Face 的管道用于 NER（命名实体识别）的方法。但是，它以内外开始 (IOB) 格式返回实体标签，但 without the IOB labels。所以我无法将管道的输出映射回我的原始文本。此外，输出以 BERT 标记化格式进行屏蔽（默认模型为 BERT-large）。

例如：

from transformers import pipeline
nlp_bert_lg = pipeline('ner')
print(nlp_bert_lg('Hugging Face is a French company based in New York.'))

输出为：

[{'word': 'Hu', 'score': 0.9968873858451843, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9329522848129272, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9781811237335205, 'entity': 'I-ORG'},
{'word': 'French', 'score': 0.9981815814971924, 'entity': 'I-MISC'},
{'word': 'New', 'score': 0.9987512826919556, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9976728558540344, 'entity': 'I-LOC'}]

如您所见，纽约分为两个标签。

如何将 Hugging Face 的 NER 管道映射回我的原始文本？

变形金刚版本：2.7

Answer 1

不幸的是，截至目前（版本 2.6，我认为即使是 2.7），您不能单独使用 pipeline 功能。由于管道调用的 __call__ 函数只是返回一个列表，请参阅 the code here。这意味着您必须使用 "external" 标记器执行第二个标记化步骤，这完全违背了管道的目的。

但是，您可以使用 second example posted on the documentation，就在与您的示例类似的示例下方。为了将来的完整性，这里是代码：

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

label_list = [
    "O",       # Outside of a named entity
    "B-MISC",  # Beginning of a miscellaneous entity right after another miscellaneous entity
    "I-MISC",  # Miscellaneous entity
    "B-PER",   # Beginning of a person's name right after another person's name
    "I-PER",   # Person's name
    "B-ORG",   # Beginning of an organisation right after another organisation
    "I-ORG",   # Organisation
    "B-LOC",   # Beginning of a location right after another location
    "I-LOC"    # Location
]

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge."

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")

outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])

这将返回您正在寻找的内容。请注意，ConLL 注释方案在其 original paper 中列出了以下内容：

Each line contains four fields: the word, its part-of-speech tag, its chunk tag and its named entity tag. Words tagged with O are outside of named entities and the I-XXX tag is used for words inside a named entity of type XXX. Whenever two entities of type XXX are immediately next to each other, the first word of the second entity will be tagged B-XXX in order to show that it starts another entity. The data contains entities of four types: persons (PER),organizations (ORG), locations (LOC) and miscellaneous names (MISC). This tagging scheme is the IOB scheme originally put forward by Ramshaw and Marcus (1995).

意思是，如果您对（仍然拆分的）实体不满意，您可以连接所有后续的 I- 标记实体，或 B- 后跟 I- 标记。在这个方案中不可能两个不同的（直接相邻的）实体都只用 I- 标签标记。

Answer 2

5 月 17 日，一个新的拉取请求 https://github.com/huggingface/transformers/pull/3957 已与您的要求合并，因此现在我们的生活更轻松了，您可以在管道中使用

ner = pipeline('ner', grouped_entities=True)

您的输出将符合预期。目前你必须从 master 分支安装，因为还没有新版本。您可以通过

pip install git+git://github.com/huggingface/transformers.git@48c3a70b4eaedab1dd9ad49990cfaa4d6cb8f6a0

Answer 3

如果你在 2022 年看到这个：

grouped_entities 关键字现已弃用
你应该使用 aggregation_strategy：默认是 None，你正在寻找 simple 或 first 或 average 或 max -> 参见 documentation of the AggregationStrategy class

from transformers import pipeline
import pandas as pd

text = 'Hugging Face is a French company based in New York.'

tagger = pipeline(task='ner', aggregation_strategy='simple')
named_ents = tagger(text)
pd.DataFrame(named_ents)

[{'entity_group': 'ORG',
  'score': 0.96934015,
  'word': 'Hugging Face',
  'start': 0,
  'end': 12},
 {'entity_group': 'MISC',
  'score': 0.9981816,
  'word': 'French',
  'start': 18,
  'end': 24},
 {'entity_group': 'LOC',
  'score': 0.9982121,
  'word': 'New York',
  'start': 42,
  'end': 50}]

如何在没有 IOB 标签的情况下使用 Hugging Face 的变压器管道重建文本实体？

How to reconstruct text entities with Hugging Face's transformers pipelines without IOB tags?

nlp

named-entity-recognition

transformer

tokenize

huggingface-transformers