将 Spacy 训练数据格式转换为 Spacy CLI 格式(用于空白 NER)
Converting Spacy Training Data format to Spacy CLI Format (for blank NER)
这是经典的训练形式。
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
我曾经使用代码进行训练,但据我了解,使用 CLI 训练方法训练效果更好。但是,我的格式是这样的。
我找到了这种类型转换的代码片段,但每个代码片段都在执行 spacy.load('en')
而不是空白 - 这让我想,他们训练的是现有模型而不是空白吗?
这块看起来很简单:
import spacy
from spacy.gold import docs_to_json
import srsly
nlp = spacy.load('en', disable=["ner"]) # as you see it's loading 'en' which I don't have
TRAIN_DATA = #data from above
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
docs.append(doc)
srsly.write_json("ent_train_data.json", [docs_to_json(docs)])
运行 此代码向我抛出:找不到模型 'en'。它似乎不是快捷方式 link、Python 包或数据目录的有效路径。
我很困惑如何将它与空白处的 spacy train
一起使用。就用spacy.blank('en')
?但是 disable=["ner"]
标志呢?
编辑:
如果我尝试 spacy.blank('en')
,我会收到 无法从 spacy.lang 导入语言目标:没有名为 'spacy.lang.en'[= 的模块25=]
编辑 2:
我试过加载 en_core_web_sm
nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
docs.append(doc)
srsly.write_json("ent_train_data.json", [docs_to_json(docs)])
TypeError: object of type 'NoneType' has no len()
Ailton - print(text[start:end])
Goal! FK Qarabag 1, Partizani Tirana 0. Filip Ozobic - FK Qarabag - shot with the head from the centre of the box to the centre of the goal. Assist - Ailton - print(text)
None - doc.ents =...
line
TypeError: object of type 'NoneType' has no len()
编辑 3:From Ines' comment
nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
tags = biluo_tags_from_offsets(doc, annot['entities'])
docs.append(doc)
srsly.write_json(train_name + "_spacy_format.json", [docs_to_json(docs)])
这创建了 json 但我在生成的 json 中没有看到任何我标记的实体。
编辑 3 已完成,但您缺少将实体添加到文档的步骤。这应该有效:
import spacy
import srsly
from spacy.gold import docs_to_json, biluo_tags_from_offsets, spans_from_biluo_tags
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
tags = biluo_tags_from_offsets(doc, annot['entities'])
entities = spans_from_biluo_tags(doc, tags)
doc.ents = entities
docs.append(doc)
srsly.write_json("spacy_format.json", [docs_to_json(docs)])
最好添加一个内置函数来执行此转换,因为通常希望从示例脚本(只是简单的演示)转移到训练 CLI。
编辑:
您也可以跳过内置 BILUO 转换器的间接使用,并使用上面的内容:
doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
import spacy
import srsly
from spacy.training import docs_to_json, offsets_to_biluo_tags, biluo_tags_to_spans
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
nlp = spacy.load('en_core_web_lg')
docs = []
for text, annot in training_sub:
doc = nlp(text)
tags = offsets_to_biluo_tags(doc, annot['entities'])
entities = biluo_tags_to_spans(doc, tags)
doc.ents = entities
docs.append(doc)
srsly.write_json("spacy_format.json", [docs_to_json(docs)])
从 spaCy v3.1 开始,以上代码有效。 spacy.gold
中的一些相关方法已重命名并迁移到 spacy.training
。
这是经典的训练形式。
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
我曾经使用代码进行训练,但据我了解,使用 CLI 训练方法训练效果更好。但是,我的格式是这样的。
我找到了这种类型转换的代码片段,但每个代码片段都在执行 spacy.load('en')
而不是空白 - 这让我想,他们训练的是现有模型而不是空白吗?
这块看起来很简单:
import spacy
from spacy.gold import docs_to_json
import srsly
nlp = spacy.load('en', disable=["ner"]) # as you see it's loading 'en' which I don't have
TRAIN_DATA = #data from above
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
docs.append(doc)
srsly.write_json("ent_train_data.json", [docs_to_json(docs)])
运行 此代码向我抛出:找不到模型 'en'。它似乎不是快捷方式 link、Python 包或数据目录的有效路径。
我很困惑如何将它与空白处的 spacy train
一起使用。就用spacy.blank('en')
?但是 disable=["ner"]
标志呢?
编辑:
如果我尝试 spacy.blank('en')
,我会收到 无法从 spacy.lang 导入语言目标:没有名为 'spacy.lang.en'[= 的模块25=]
编辑 2:
我试过加载 en_core_web_sm
nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
docs.append(doc)
srsly.write_json("ent_train_data.json", [docs_to_json(docs)])
TypeError: object of type 'NoneType' has no len()
Ailton -
print(text[start:end])
Goal! FK Qarabag 1, Partizani Tirana 0. Filip Ozobic - FK Qarabag - shot with the head from the centre of the box to the centre of the goal. Assist - Ailton -
print(text)
None -
doc.ents =...
lineTypeError: object of type 'NoneType' has no len()
编辑 3:From Ines' comment
nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
tags = biluo_tags_from_offsets(doc, annot['entities'])
docs.append(doc)
srsly.write_json(train_name + "_spacy_format.json", [docs_to_json(docs)])
这创建了 json 但我在生成的 json 中没有看到任何我标记的实体。
编辑 3 已完成,但您缺少将实体添加到文档的步骤。这应该有效:
import spacy
import srsly
from spacy.gold import docs_to_json, biluo_tags_from_offsets, spans_from_biluo_tags
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
tags = biluo_tags_from_offsets(doc, annot['entities'])
entities = spans_from_biluo_tags(doc, tags)
doc.ents = entities
docs.append(doc)
srsly.write_json("spacy_format.json", [docs_to_json(docs)])
最好添加一个内置函数来执行此转换,因为通常希望从示例脚本(只是简单的演示)转移到训练 CLI。
编辑:
您也可以跳过内置 BILUO 转换器的间接使用,并使用上面的内容:
doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
import spacy
import srsly
from spacy.training import docs_to_json, offsets_to_biluo_tags, biluo_tags_to_spans
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
nlp = spacy.load('en_core_web_lg')
docs = []
for text, annot in training_sub:
doc = nlp(text)
tags = offsets_to_biluo_tags(doc, annot['entities'])
entities = biluo_tags_to_spans(doc, tags)
doc.ents = entities
docs.append(doc)
srsly.write_json("spacy_format.json", [docs_to_json(docs)])
从 spaCy v3.1 开始,以上代码有效。 spacy.gold
中的一些相关方法已重命名并迁移到 spacy.training
。