我可以使用 spaCy 命令行工具来训练包含其他实体类型的 NER 模型吗?

Can I use the spaCy command line tools to train an NER model containing an additional entity type?

我正在尝试仅使用 python -m spacy train command line tool without writing any code of my own.

来训练 spaCy 模型

我有一个文档训练集,我已将 OIL_COMPANY 个实体跨度添加到其中。我用了 gold.docs_to_json to create training files in the JSON-serializable format.

我可以从空模型开始训练。但是,如果我尝试扩展现有的 en_core_web_lg 模型,我会看到以下错误。

KeyError: "[E022] Could not find a transition with the name 'B-OIL_COMPANY' in the NER model."

所以我需要能够告诉命令行工具将 OIL_COMPANY 添加到现有的 NER 标签列表中。 Training an additional entity type shows how to do this in code by calling add_label 中关于 NER 管道的讨论,但我没有看到执行此操作的任何命令行选项。

是否可以仅使用命令行训练工具将现有的 NER 模型扩展到新实体,还是我必须编写代码?

有关 spaCy 中的 CLI,请参阅 this link。

Train a model. Expects data in spaCy’s JSON format. On each epoch, a model will be saved out to the directory. Accuracy scores and model details will be added to a meta.json to allow packaging the model using the package command.

python -m spacy train [lang] [output_path] [train_path] [dev_path]
[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-early-stopping]
[--n-examples] [--use-gpu] [--version] [--meta-path] [--init-tok2vec]
[--parser-multitasks] [--entity-multitasks] [--gold-preproc] [--noise-level]
[--orth-variant-level] [--learn-tokens] [--textcat-arch] [--textcat-multilabel]
[--textcat-positive-label] [--verbose]

Ines answered this 我在 Prodigy 支持论坛上。

I think what's happening here is that the spacy train command expects the base model you want to update to already have all labels added that you want to train. (It processes the data as a stream, so it's not going to compile all labels upfront and silently add them on the fly.) So if you want to update an existing pretrained model and add a new label, you should be able to just add the label and save out the base model:

ner = nlp.get_pipe("ner") ner.add_label("YOUR_LABEL")
nlp.to_disk("./base-model")

这不是完全没有写代码,但已经很接近了。