NER returns 部分单词的转换器管道与 ##s
Transformer Pipeline for NER returns partial words with ##s
我应该如何解释由 Transformer NER 管道 return 编辑的带有“##”的部分单词?其他工具,如 Flair 和 SpaCy return 这个词及其标签。我以前使用过 CONLL 数据集,但从未注意到这样的事情。而且,为什么要这样分词?
来自 HuggingFace 的示例:
from transformers import pipeline
nlp = pipeline("ner")
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge which is visible from the window."
print(nlp(sequence))
输出:
[
{'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
{'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
{'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
{'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
{'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
{'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
{'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
{'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
{'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]
Pytorch transformers和BERT做2个token,常规词作为token,词+子词作为token;将单词除以它们的基本含义+它们的补语,在开头添加“##”。
假设您有短语:I like hugging animals
第一组标记是:
["I", "like", "hugging", "animals"]
第二个带有子词的列表是:
["I", "like", "hug", "##gging", "animal", "##s"]
您可以在此处了解更多信息:
https://www.kaggle.com/funtowiczmo/hugging-face-tutorials-training-tokenizer
我应该如何解释由 Transformer NER 管道 return 编辑的带有“##”的部分单词?其他工具,如 Flair 和 SpaCy return 这个词及其标签。我以前使用过 CONLL 数据集,但从未注意到这样的事情。而且,为什么要这样分词?
来自 HuggingFace 的示例:
from transformers import pipeline
nlp = pipeline("ner")
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge which is visible from the window."
print(nlp(sequence))
输出:
[
{'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
{'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
{'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
{'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
{'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
{'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
{'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
{'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
{'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]
Pytorch transformers和BERT做2个token,常规词作为token,词+子词作为token;将单词除以它们的基本含义+它们的补语,在开头添加“##”。
假设您有短语:I like hugging animals
第一组标记是:
["I", "like", "hugging", "animals"]
第二个带有子词的列表是:
["I", "like", "hug", "##gging", "animal", "##s"]
您可以在此处了解更多信息: https://www.kaggle.com/funtowiczmo/hugging-face-tutorials-training-tokenizer