如何在没有 IOB 标签的情况下使用 Hugging Face 的变压器管道重建文本实体?
How to reconstruct text entities with Hugging Face's transformers pipelines without IOB tags?
我一直在寻找将 Hugging Face 的管道用于 NER(命名实体识别)的方法。但是,它以内外开始 (IOB) 格式返回实体标签,但 without the IOB labels。所以我无法将管道的输出映射回我的原始文本。此外,输出以 BERT 标记化格式进行屏蔽(默认模型为 BERT-large)。
例如:
from transformers import pipeline
nlp_bert_lg = pipeline('ner')
print(nlp_bert_lg('Hugging Face is a French company based in New York.'))
输出为:
[{'word': 'Hu', 'score': 0.9968873858451843, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9329522848129272, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9781811237335205, 'entity': 'I-ORG'},
{'word': 'French', 'score': 0.9981815814971924, 'entity': 'I-MISC'},
{'word': 'New', 'score': 0.9987512826919556, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9976728558540344, 'entity': 'I-LOC'}]
如您所见,纽约分为两个标签。
如何将 Hugging Face 的 NER 管道映射回我的原始文本?
变形金刚版本:2.7
不幸的是,截至目前(版本 2.6,我认为即使是 2.7),您不能单独使用 pipeline
功能。由于管道调用的 __call__
函数只是返回一个列表,请参阅 the code here。这意味着您必须使用 "external" 标记器执行第二个标记化步骤,这完全违背了管道的目的。
但是,您可以使用 second example posted on the documentation,就在与您的示例类似的示例下方。为了将来的完整性,这里是代码:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
label_list = [
"O", # Outside of a named entity
"B-MISC", # Beginning of a miscellaneous entity right after another miscellaneous entity
"I-MISC", # Miscellaneous entity
"B-PER", # Beginning of a person's name right after another person's name
"I-PER", # Person's name
"B-ORG", # Beginning of an organisation right after another organisation
"I-ORG", # Organisation
"B-LOC", # Beginning of a location right after another location
"I-LOC" # Location
]
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge."
# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)
print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])
这将返回您正在寻找的内容。请注意,ConLL 注释方案在其 original paper 中列出了以下内容:
Each line contains four fields: the word, its part-of-speech tag, its chunk tag and its named entity tag. Words tagged with O are outside of named entities and the I-XXX tag is used for words inside a named entity of type XXX. Whenever two entities of type XXX are immediately next to each other, the first word of the second entity will be tagged B-XXX in order to show that it starts another entity. The data contains entities of four types: persons (PER),organizations (ORG), locations (LOC) and miscellaneous names (MISC). This tagging scheme is the IOB scheme originally put forward by Ramshaw and Marcus (1995).
意思是,如果您对(仍然拆分的)实体不满意,您可以连接所有后续的 I-
标记实体,或 B-
后跟 I-
标记。在这个方案中不可能两个不同的(直接相邻的)实体都只用 I-
标签标记。
5 月 17 日,一个新的拉取请求 https://github.com/huggingface/transformers/pull/3957 已与您的要求合并,因此现在我们的生活更轻松了,您可以在管道中使用
ner = pipeline('ner', grouped_entities=True)
您的输出将符合预期。目前你必须从 master 分支安装,因为还没有新版本。您可以通过
pip install git+git://github.com/huggingface/transformers.git@48c3a70b4eaedab1dd9ad49990cfaa4d6cb8f6a0
如果你在 2022 年看到这个:
grouped_entities
关键字现已弃用
- 你应该使用
aggregation_strategy
:默认是 None
,你正在寻找 simple
或 first
或 average
或 max
-> 参见 documentation of the AggregationStrategy
class
from transformers import pipeline
import pandas as pd
text = 'Hugging Face is a French company based in New York.'
tagger = pipeline(task='ner', aggregation_strategy='simple')
named_ents = tagger(text)
pd.DataFrame(named_ents)
[{'entity_group': 'ORG',
'score': 0.96934015,
'word': 'Hugging Face',
'start': 0,
'end': 12},
{'entity_group': 'MISC',
'score': 0.9981816,
'word': 'French',
'start': 18,
'end': 24},
{'entity_group': 'LOC',
'score': 0.9982121,
'word': 'New York',
'start': 42,
'end': 50}]
我一直在寻找将 Hugging Face 的管道用于 NER(命名实体识别)的方法。但是,它以内外开始 (IOB) 格式返回实体标签,但 without the IOB labels。所以我无法将管道的输出映射回我的原始文本。此外,输出以 BERT 标记化格式进行屏蔽(默认模型为 BERT-large)。
例如:
from transformers import pipeline
nlp_bert_lg = pipeline('ner')
print(nlp_bert_lg('Hugging Face is a French company based in New York.'))
输出为:
[{'word': 'Hu', 'score': 0.9968873858451843, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9329522848129272, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9781811237335205, 'entity': 'I-ORG'},
{'word': 'French', 'score': 0.9981815814971924, 'entity': 'I-MISC'},
{'word': 'New', 'score': 0.9987512826919556, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9976728558540344, 'entity': 'I-LOC'}]
如您所见,纽约分为两个标签。
如何将 Hugging Face 的 NER 管道映射回我的原始文本?
变形金刚版本:2.7
不幸的是,截至目前(版本 2.6,我认为即使是 2.7),您不能单独使用 pipeline
功能。由于管道调用的 __call__
函数只是返回一个列表,请参阅 the code here。这意味着您必须使用 "external" 标记器执行第二个标记化步骤,这完全违背了管道的目的。
但是,您可以使用 second example posted on the documentation,就在与您的示例类似的示例下方。为了将来的完整性,这里是代码:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
label_list = [
"O", # Outside of a named entity
"B-MISC", # Beginning of a miscellaneous entity right after another miscellaneous entity
"I-MISC", # Miscellaneous entity
"B-PER", # Beginning of a person's name right after another person's name
"I-PER", # Person's name
"B-ORG", # Beginning of an organisation right after another organisation
"I-ORG", # Organisation
"B-LOC", # Beginning of a location right after another location
"I-LOC" # Location
]
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge."
# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)
print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])
这将返回您正在寻找的内容。请注意,ConLL 注释方案在其 original paper 中列出了以下内容:
Each line contains four fields: the word, its part-of-speech tag, its chunk tag and its named entity tag. Words tagged with O are outside of named entities and the I-XXX tag is used for words inside a named entity of type XXX. Whenever two entities of type XXX are immediately next to each other, the first word of the second entity will be tagged B-XXX in order to show that it starts another entity. The data contains entities of four types: persons (PER),organizations (ORG), locations (LOC) and miscellaneous names (MISC). This tagging scheme is the IOB scheme originally put forward by Ramshaw and Marcus (1995).
意思是,如果您对(仍然拆分的)实体不满意,您可以连接所有后续的 I-
标记实体,或 B-
后跟 I-
标记。在这个方案中不可能两个不同的(直接相邻的)实体都只用 I-
标签标记。
5 月 17 日,一个新的拉取请求 https://github.com/huggingface/transformers/pull/3957 已与您的要求合并,因此现在我们的生活更轻松了,您可以在管道中使用
ner = pipeline('ner', grouped_entities=True)
您的输出将符合预期。目前你必须从 master 分支安装,因为还没有新版本。您可以通过
pip install git+git://github.com/huggingface/transformers.git@48c3a70b4eaedab1dd9ad49990cfaa4d6cb8f6a0
如果你在 2022 年看到这个:
grouped_entities
关键字现已弃用- 你应该使用
aggregation_strategy
:默认是None
,你正在寻找simple
或first
或average
或max
-> 参见 documentation of theAggregationStrategy
class
from transformers import pipeline
import pandas as pd
text = 'Hugging Face is a French company based in New York.'
tagger = pipeline(task='ner', aggregation_strategy='simple')
named_ents = tagger(text)
pd.DataFrame(named_ents)
[{'entity_group': 'ORG',
'score': 0.96934015,
'word': 'Hugging Face',
'start': 0,
'end': 12},
{'entity_group': 'MISC',
'score': 0.9981816,
'word': 'French',
'start': 18,
'end': 24},
{'entity_group': 'LOC',
'score': 0.9982121,
'word': 'New York',
'start': 42,
'end': 50}]