Stanford corenlp 暂停并继续注释管道

Question

通常，当您使用 corenlp 注释管道来表示 NER 时，您会编写以下代码

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);

我想在上面的管道中执行句子拆分，即 ssplit。但是在我继续管道的其余部分之前，我想删除太长的句子。我一直在做的是执行句子拆分，按长度过滤句子，然后通过应用整个管道执行 NER，即 tokenize, ssplit, pos, lemma, ner。所以基本上我已经执行了 tokenize 和 ssplit 两次。有没有更有效的方法来做到这一点？例如执行 tokenize 和 ssplit，然后暂停管道以删除太长的句子，然后使用 pos、lemma 和 ner 恢复管道。

Answer 1

您可以创建两个管道对象，第二个使用后面的注释器。所以：

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);

其次是：

Properties props = new Properties();
props.put("annotators", "pos, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);
pipeline.annotate(document);

当然请注意，如果删除中间句子，某些注释（例如字符偏移）将不直观。

Stanford corenlp 暂停并继续注释管道

Stanford corenlp pause and continue annotation pipeline

java

stanford-nlp