CRFClassifier 无法识别句子拆分器选项

Question

我正在使用 CoreNLP 在多行英文文本中注释 NE。当做如下：

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
props.put("ssplit.newlineIsSentenceBreak", "always");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String contentStr = "John speaks with Martin\n\nJeremy talks to him too.";
Annotation document 
= new  Annotation(contentStr);
pipeline.annotate(document);
List<CoreMap> sents = document.get(SentencesAnnotation.class);
for (int i = 0; i < sents.size(); i++) {
    System.out.println("sentence " + i + " "+ sents.get(i));
}

句子拆分效果很好，可以识别两个句子。但是，当我使用 NER 分类时如下：

CRFClassifier classifier = CRFClassifier.getClassifier("edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz", props);
String classifiedStr = classifier.classifyWithInlineXML(contentStr);

我收到以下错误消息：

Unknown property: |ssplit.newlineIsSentenceBreak|  Unknown property: |annotators|

并且分类器似乎将所有文本视为一个句子，导致错误识别一个实体 "Martin Jeremy" 而不是两个不同的实体。

知道哪里出了问题吗？

Answer 1

CRFClassifier.getClassifier 所采用的属性与 StanfordCoreNLP 构造函数所采用的属性不同，这就是为什么您会收到选项未知错误的原因。

会设置，但运行时不会使用。

从here开始，您会发现需要设置SeqClassifierFlags的属性。您需要设置 tokenizerOptions，并将选项设置为 "tokenizeNLs = true"，它将新行视为标记。

最重要的是，在获取分类器之前，按如下方式设置属性。它不应该给你未知的错误属性，它应该按预期工作。

Properties props = new Properties();
props.put("tokenizerOptions", "tokenizeNLs=true");

CRFClassifier classifier = CRFClassifier.getClassifier("edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz", props);
String classifiedStr = classifier.classifyWithInlineXML(contentStr);

CRFClassifier 无法识别句子拆分器选项

CRFClassifier doesn't recognize sentence splitter options

java

nlp

stanford-nlp