强制 Stanford CoreNLP 解析器在根级别优先处理 'S' 标签

Question

问候 NLP 专家，

我正在使用 Stanford CoreNLP 软件包生成选区分析，使用从 CoreNLP Download page 下载的最新版本 (3.9.2) 的英语语言模型 JAR。我通过 NLTK 模块 nltk.parse.corenlp 的 Python 接口访问解析器。这是我主模块顶部的一个片段：

import nltk
from nltk.tree import ParentedTree
from nltk.parse.corenlp import CoreNLPParser

parser = CoreNLPParser(url='http://localhost:9000')

我还使用来自终端的以下（相当通用的）调用来启动服务器：

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
-annotators "parse" -port 9000 -timeout 30000

CoreNLP 默认选择的解析器（当完整的英语模型可用时）是 Shift-Reduce (SR) 解析器，is sometimes claimed 比 CoreNLP PCFG 解析器更准确、更快。印象派，我可以用我自己的经验来证实这一点，我几乎只处理维基百科文本。

但是，我注意到解析器通常会错误地选择将实际上是一个完整的句子（即有限的矩阵子句）解析为次要成分，通常是 NP。换句话说，解析器应该在根级别 (ROOT (S ...)) 输出一个 S 标签，但是句子语法的复杂性促使解析器说一个句子不是一个句子 (ROOT (NP ...))等

对此类问题句子的解析也总是在树的更下方包含另一个（通常是明显的）错误。下面是几个例子。我将只粘贴每棵树的前几层以保存 space。每个都是完全可以接受的英语句子，因此解析应该全部开始 (ROOT (S ...))。然而，在每种情况下，一些其他标签取代了 S，树的其余部分是乱码。

NP: An estimated 22–189 million school days are missed annually due to a cold. (ROOT (NP (NP An estimated 22) (: --) (S 189 million school days are missed annually due to a cold) (. .)))

FRAG: More than one-third of people who saw a doctor received an antibiotic prescription, which has implications for antibiotic resistance. (ROOT (FRAG (NP (NP More than one-third) (PP of people who saw a doctor received an antibiotic prescription, which has implications for antibiotic resistance)) (. .)))

UCP: Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. (ROOT (UCP (S Coffee is a brewed drink prepared from roasted coffee beans) (, ,) (NP the seeds of berries from certain Coffea species) (. .)))

最后，这是我的问题，我相信上述证据证明这是一个有用的问题：鉴于我的数据包含可忽略不计的片段或其他格式错误的句子，如何我对 CoreNLP 解析器施加了一个高级约束，使其算法优先分配 ROOT?

正下方的 S 节点

我很好奇在处理数据时施加这样的约束（人们知道要满足它）是否也会治愈在生成的解析中观察到的其他无数问题。据我了解，解决方案不在于指定 ParserAnnotations.ConstraintAnnotation。会吗？

Answer 1

您可以指定某个范围必须以某种方式标记。所以你可以说整个范围必须是 'S'。但我认为您必须在 Java 代码中执行此操作。

这是显示设置约束的示例代码。

https://github.com/stanfordnlp/CoreNLP/blob/master/itest/src/edu/stanford/nlp/parser/shiftreduce/ShiftReduceParserITest.java

强制 Stanford CoreNLP 解析器在根级别优先处理 'S' 标签

Force Stanford CoreNLP Parser to Prioritize 'S' Label at Root Level

python

nlp

nltk

stanford-nlp