有没有办法使用 SpaCy 获取整个成分？

Question

我想我正在尝试以比所提供的更直接的方式导航 SpaCy 的解析树。

例如，如果我有这样的句子："He was a genius" 或 "The dog was green," 我希望能够将对象保存到变量（"a genius" 和 "green"）。

token.children 提供了即时语法依赖，因此，对于第一个示例，"was" 的子项是 "he" 和 "genius,"，然后是 "a"是 "genius." 的子项，如果我只想要整个成分 "a genius."，这就没有太大帮助了。我不确定如何从 token.children 重建它，或者是否有更好的方法。

我可以弄清楚如何使用 token.text 来匹配 "is" 和 "was"（我正在尝试做的一部分），但我不知道如何return 全体成员 "a genius" 使用提供的有关儿童的信息。

import spacy
nlp = spacy.load('en_core_web_sm')

sent = nlp("He was a genius.")

for token in sent:
     print(token.text, token.tag_, token.dep_, [child for child in token.children])

这是输出：

He PRP nsubj []

是 VBD ROOT [他，天才，.]

一个 DT det []

天才NN属性[a]

。 .点[]

Answer 1

您可以使用 Token.subtree（参见 the docs）获取依赖关系树中给定节点的所有依赖关系。

例如获取所有名词短语：

import spacy

nlp = spacy.load('en')

text = "He was a genius of the best kind and his dog was green."

for token in nlp(text):
    if token.pos_ in ['NOUN', 'ADJ']:
        if token.dep_ in ['attr', 'acomp'] and token.head.lemma_ == 'be':
            # to test for only verb forms 'is' and 'was' use token.head.lower_ in ['is', 'was']
            print([t.text for t in token.subtree])

输出：

['a', 'genius', 'of', 'the', 'best', 'kind']
['green']

有没有办法使用 SpaCy 获取整个成分？

Is there a way to get entire constituents using SpaCy?

python

nlp

tokenize

spacy

这是输出：