在 Stanzas Corenlp 实现中与 tregex 的模式匹配似乎没有找到正确的子树

Question

我对 NLP 比较陌生，目前我正在尝试提取德语文本中的不同短语结构。为此，我正在使用节的斯坦福 corenlp 实现和树中模式匹配的 tregex 特性。

到目前为止，我没有遇到任何问题，我能够匹配简单的模式，如“NPs”或“S > CS”。不，我正在尝试匹配直接由 ROOT 控制的 S 节点或由 ROOT 立即控制的 CS 节点。为此我使用模式“S > (CS > TOP) | > TOP”。但它似乎无法正常工作。我正在使用以下代码：

text = "Peter kommt und Paul geht."    
def linguistic_units(_client, _text, _pattern):
        matches = _client.tregex(_text,_pattern)
        list = matches['sentences']
        print('+++++Tree++++') 
        print(list[0])
        for sentence in matches['sentences']:
            for match_id in sentence:
                print(sentence[match_id]['spanString'])
        return count_units



with CoreNLPClient(properties='./corenlp/StanfordCoreNLP-german.properties', 
                   annotators=['tokenize', 'ssplit', 'pos', 'lemma', 'ner', 'parse', 'depparse', 'coref'],
                   timeout=300000, 
                   be_quiet=True,
                   endpoint='http://localhost:9001', 
                   memory='16G') as client:

      result = linguistic_units(client, text, 'S > (CS > ROOT) | > ROOT'
      print(result)

在文本为“Peter kommt und Paul geht”的示例中，我使用的模式应该匹配两个短语“Peter kommt”和“Paul geht”，但它不起作用。之后我查看了树本身，解析器的输出如下：

constituency parse of first sentence
child {
  child {
    child {
      child {
        child {
          value: "Peter"
        }
        value: "PROPN"
      }
      child {
        child {
          value: "kommt"
        }
        value: "VERB"
      }
      value: "S"
    }
    child {
      child {
        value: "und"
      }
      value: "CCONJ"
    }
    child {
      child {
        child {
          value: "Paul"
        }
        value: "PROPN"
      }
      child {
        child {
          value: "geht"
        }
        value: "VERB"
      }
      value: "S"
    }
    value: "CS"
  }
  child {
    child {
      value: "."
    }
    value: "PUNCT"
  }
  value: "NUR"
}
value: "ROOT"
score: 5466.83349609375

我现在怀疑这是由于 ROOT 节点造成的，因为它是树的最后一个节点。 ROOT 节点不应该在树的开头吗？有谁知道我做错了什么？

Answer 1

几点评论：

1.) 假设您使用的是最新版本的 CoreNLP (4.0.0+)，您需要使用带有德语的 mwt 注释器。所以你的注释者列表应该是 tokenize,ssplit,mwt,pos,parse

2.) 为了清楚起见，这是你在 PTB 中的句子：

(ROOT
  (NUR
    (CS
      (S (PROPN Peter) (VERB kommt))
      (CCONJ und)
      (S (PROPN Paul) (VERB geht)))))

如您所见，ROOT 是树的根节点，因此您的模式在这句话中不匹配。我个人发现 PTB 格式更容易看到树结构，也更容易编写 Tregex 模式。您可以通过 json 或文本输出格式（而不是序列化对象）获得它。在客户端请求中设置output_format="text"

3.) 以下是有关使用 Stanza 客户端的最新文档：https://stanfordnlp.github.io/stanza/client_properties.html

在 Stanzas Corenlp 实现中与 tregex 的模式匹配似乎没有找到正确的子树

Pattern matching with tregex in Stanzas Corenlp implementation doesn't seem to finde the right subtrees

stanford-nlp