带有 CollapsedCCProcessedDependenciesAnnotation 的 CoreNLP ConLL 格式

CoreNLP ConLL format with CollapsedCCProcessedDependenciesAnnotation

我正在使用最新版本的 CoreNLP。

我的任务是解析文本并使用 CollapsedCCProcessedDependenciesAnnotation 获得 conll 格式的输出。

我运行以下命令

time java -cp $CoreNLP/javanlp-core.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -props $CoreNLP/config.properties -file 12309959  -outputFormat conll


depparse.model = english_SD.gz

问题是如何得到CollapsedCCProcessedDependenciesAnnotation.

我试过用 depparse.extradependencies 在 config.properties

但是CCProcessedDependenciesAnnotation没有参数根据 http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/GrammaticalStructure.Extras.html#REF_ONLY_COLLAPSED

你能提出任何解决方案,我可以如何在 conll 中解析 CollapsedCCProcessedDependenciesAnnotation

您可以通过编程方式检索 CC 处理的依赖项。

This question 应该是一个很好的例子(请参阅示例中使用 CollapsedCCProcessedDependenciesAnnotation 的代码)。


Gabor 在邮件列表中的回答很好地解释了这种行为(即,为什么你不能直接输出折叠的依赖项):

Note that in general the collapsed cc processed dependencies won't output losslessly to conll though, as the format expects a tree (every word has a unique parent), and the dependencies can have multiple heads.

The output formatter therefore uses the basic dependencies only: https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/CoNLLOutputter.java#L118. This could be changed in the code without crashing anything, but the serialized trees would be missing some edges, and ties for which edges are included would be broken somewhat arbitrarily. You may be better off writing your own logic for dumping to conll to fit your particular use case (you can probably copy much of our conll outputter code from above).