如何解决 Stanford CoreNLP 中特定名词短语的共指

Question

我有一系列相当大的文本文件，我希望解决每个文件中对特定名词短语的引用，例如。 'Harry Potter'

我不想运行完整地处理每一种可能的参考解析，因为那样会花费太长时间。

非常感谢！

这是我目前所知道的...

import edu.stanford.nlp.io.*;
import edu.stanford.nlp.pipeline.*;

import java.io.*;
import java.util.Properties;

public class Main {

public static void main(String[] args) throws IOException
{
    // SET INPUT AND OUTPUT FILES
    FileOutputStream xmlOut = new FileOutputStream(new File("nlp.xml"));
    String input_filename = "weblink_text.txt";
    String file_contents = IOUtils.slurpFileNoExceptions(input_filename);

    //SET PROPERTIES
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize, cleanxml, ssplit, pos, lemma, ner, parse, dcoref");

    // ANNOTATE AND OUTPUT
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation annotation = new Annotation(file_contents);

    pipeline.annotate(annotation);
    pipeline.xmlPrint(annotation, xmlOut);

    System.out.println("Completed");
}
}

Answer 1

不幸的是，需要进行完整的共指分析才能获得特定名词短语的任何可用注释。（如果不进行全面分析，就不可能解决像代词这样的艰难回指。）

我能推荐的最好的方法是你在小块中进行处理，这些块在共同引用方面相当"independent"（例如，一本书的章节）。

Answer 2

1) 如果您只关心代词前因的共指解析，我建议您查看 David Bamman 的 book-nlp.

它对小说长度的文本进行非常快速的 coref，但仅适用于代词前因（无论如何这可能是您最感兴趣的）。

然后您可以阅读 .tokens 文件并构建您自己的共指图。

2) 如果你真的需要解析 coref 不止于此，请尝试设置 dcoref.maxdist 参数以防止它从第 20 章返回到第 1 章寻找 material，例如。然后我会保存一些版本的注释文本（例如序列化）以供稍后加载，这样您就不必保留运行 this.

[编辑] 3) 在不久的将来，斯坦福 CoreNLP 构建 (https://github.com/stanfordnlp/CoreNLP/tree/master/src/edu/stanford/nlp/hcoref) 中将有一个新的 coref 系统 (hcoref)，它基于 depparse，速度要快得多.我一直在运行整部小说的 100 - 500 句长的文本块，这对我很有效。（在 hcoref 中还没有等同于 dcoref.maxdist）

另请注意：如果解析时间也非常昂贵，请尝试设置 parse.maxlen。

如何解决 Stanford CoreNLP 中特定名词短语的共指

How to resolve coreferences for a specific Noun phrase in Stanford CoreNLP

stanford-nlp