StanfordNLP 检测带介词的复合实体

Question

基本上，在句子中：

<Lord of the bracelets> is a fantasy movie.

我想将化合物 Lord of the bracelets 检测为一个实体（也可以在 entitylink 注释器中链接）。这意味着检测具有 NNP DT NNP 或 NN IN DT NNP.[= 形式的 POS 标签的结构20=]

CoreNLP 可以吗？

我当前的设置没有检测到它们，我找不到检测方法。


  public NamedEntityRecognition() {
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitylink");
    props.setProperty("tokenize.options", "untokenizable=noneDelete");

    pipeline = new StanfordCoreNLP(props);
  }


  public CoreDocument recogniseEntities(String text) {
    CoreDocument doc = new CoreDocument(text);
    pipeline.annotate(doc);
    return doc;
  }

谢谢！

Answer 1

您可以使用 TokensRegex（可能是 RegexNER，但我不这么认为）来做到这一点。您可以在规则中指定要将某些词性标记模式标记为实体。

此处提供了 TokensRegex 的完整描述：

https://stanfordnlp.github.io/CoreNLP/tokensregex.html

Answer 2

虽然@StanfordNLPHelp 的回答很有帮助，但我想我会在我的最终解决方案中添加更多细节。

选项 1：

添加一个TokensRegex annotator，如前一个答案所指出的。这会向管道添加更可自定义的注释器，您可以在文本文件中指定自己的规则。

这是我的规则文件 (extended_ner.rules) 的样子：

# these Java classes will be used by the rules
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
tokens = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }

# rule for recognizing compound names
{ ruleType: "tokens", pattern: ([{tag:"NN"}] [{tag:"IN"}] [{tag:"DT"}] [{tag:"NNP"}]), action: Annotate([=10=], ner, "COMPOUND"), result: "COMPOUND_RESULT" }

您可以看到规则语法的细分 here。

注意：TokensRegex 注释器必须添加在ner 注释器之后。否则，结果将被覆盖。

这就是 Java 代码的样子：

 public NamedEntityRecognition() {
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex,entitylink");
    props.setProperty("tokensregex.rules", "extended_ner.rules");
    props.setProperty("tokenize.options", "untokenizable=noneDelete");

    pipeline = new StanfordCoreNLP(props);
  }

选项 2（选择一个）

可以通过 de "ner.additional.tokensregex.rules" 属性将规则文件发送到 ner 注释器，而不是添加另一个注释器。 Here 是文档。

我选择这个选项是因为它看起来更简单，而且在管道中添加另一个注释器对我来说似乎有点过头了。

规则文件与选项 1 完全相同，java 代码现在是：

 public NamedEntityRecognition() {
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitylink");
    props.setProperty("ner.additional.tokensregex.rules", "extended_ner.rules");

    props.setProperty("tokenize.options", "untokenizable=noneDelete");

    pipeline = new StanfordCoreNLP(props);
  }

注意： 要使其工作，属性 "ner.applyFineGrained" 必须为真（默认值）。

StanfordNLP 检测带介词的复合实体

StanfordNLP to detect compound entities with prepositions

java

named-entity-recognition

stanford-nlp

选项 1：

选项 2（选择一个）