如何在 spacy 管道中使用 lemmatiser 内置的 spacys?

How to use spacys built in lemmatiser in a spacy pipeline?

我想使用词形还原,但我无法在文档中直接看到如何在管道中使用内置词形还原的 Spacys。

在文档 for the lemmatiser 中,它说:

Initialize a Lemmatizer. Typically, this happens under the hood within spaCy when a Language subclass and its Vocab is initialized.

这是否意味着内置词形还原过程是管道中未提及的部分?

It's mentioned in the docs as part of the pipeline subheading

而在 docs for the pipeline usage 中只提到了 "custom lemmatisation" 以及如何使用它。

这是各种相互矛盾的信息。

Does this mean the build in lemmatisation process is an unmentioned part of the pipeline?

简单地说,是的。当加载 LanguageVocab 时加载 Lemmatizer

用法示例:

import spacy
nlp=spacy.load('en_core_web_sm')
doc= nlp(u"Apples and oranges are similar. Boots and hippos aren't.")
print('\n')
print("Token Attributes: \n", "token.text, token.pos_, token.tag_, token.dep_, token.lemma_")
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print("{:<12}{:<12}{:<12}{:<12}{:<12}".format(token.text, token.pos_, token.tag_, token.dep_, token.lemma_))

输出:

Token Attributes: 
 token.text, token.pos_, token.tag_, token.dep_, token.lemma_
Apples      NOUN        NNS         nsubj       apple       
and         CCONJ       CC          cc          and         
oranges     NOUN        NNS         conj        orange      
are         AUX         VBP         ROOT        be          
similar     ADJ         JJ          acomp       similar     
.           PUNCT       .           punct       .           
Boots       NOUN        NNS         nsubj       boot        
and         CCONJ       CC          cc          and         
hippos      NOUN        NN          conj        hippos      
are         AUX         VBP         ROOT        be          
n't         PART        RB          neg         not         
.           PUNCT       .           punct       .      

同时查看 this 线程,其中有一些关于词形还原速度的有趣信息。