遍历令牌并找到令牌的实体

Question

问题

在运行 CoreNLP 处理一些文本后，我想重建一个句子，为每个 Token 添加 POS-tag 并将形成实体的 token 分组。

如果有一种方法可以查看 Token 属于哪个实体，则可以轻松完成此操作。

方法

我现在正在考虑的一个选项是通过 sentence.tokens() 并在仅包含该句子的所有 CoreEntityMentions 中的标记的列表中找到索引。然后我可以看到 Token 属于哪个 CoreEntityMention，所以我可以将它们分组。

另一种选择是查看句子中每个标记的偏移量，并将其与每个 CoreEntityMention 的偏移量进行比较。

我认为这个问题与 here 的问题类似，但因为是前一段时间，也许 API 已经改变了。

这是设置：

    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner");

    pipeline = new StanfordCoreNLP(props);
    String text = "Some text with entities goes here";
    CoreDocument coreDoc = new CoreDocument(text);
    // annotate the document
    pipeline.annotate(coreDoc);
    for (CoreSentence sentence : coreDoc.sentences()) {
      // Code goes here
      List<CoreEntityMention> em : sentence.entityMentions();
    }

Answer 1

实体提及中的每个标记都包含一个索引，它对应于文档中的实体提及。

cl.get(CoreAnnotations.EntityMentionIndexAnnotation.class);

我会记下来为这个未来的版本添加一个方便的方法。

遍历令牌并找到令牌的实体

Iterate through tokens and find the entity for a token

stanford-nlp