spacy 词形还原与 lemma_lookup table 不一致

Question

与在 Vocab 中查找单词的引理相比，在遍历 spacy 文档并对标记进行词形还原时似乎存在不一致 lemma_lookup table.

nlp = spacy.load("en_core_web_lg")
doc = nlp("I'm running faster")
for tok in doc: 
  print(tok.lemma_)

这将打印出 "faster" 作为标记 "faster" 而不是 "fast" 的引理。但是令牌确实存在于 lemma_lookup table.

nlp.vocab.lookups.get_table("lemma_lookup")["faster"]

输出"fast"

我是不是做错了什么？还是这两者不同的另一个原因？也许我的定义不正确，我在比较苹果和橘子？

我在 Ubuntu Linux 上使用以下版本：空间==2.2.4 spacy-lookups-data==0.1.0

Answer 1

使用像 en_core_web_lg 这样的模型，它包括一个标记器和基于规则的词形还原器的规则，当 POS 标签可用于规则时，它提供基于规则的词而不是查找词。查找引理总体上不是很好，仅在 model/pipeline 没有足够的信息来提供基于规则的引理时用作备份。

对于 faster，POS 标记为 ADV，规则保持原样。如果它被标记为 ADJ，则根据当前规则，引理将是 fast。

lemmatizer 试图在不需要用户管理任何设置的情况下提供最好的词条，但它现在也不是很容易配置 (v2.2)。如果您想运行标记器但有查找引理，则必须在运行标记器之后替换引理。

Answer 2

aab 写道：

The lookup lemmas aren't great overall and are only used as a backup if the model/pipeline doesn't have enough information to provide the rule-based lemmas.

这也是我从 spaCy 代码中理解的方式，但由于我想添加自己的词典以改进预训练模型的词形还原，我决定尝试以下方法，效果很好：

#load model
nlp = spacy.load('es_core_news_lg')
#define dictionary, where key = lemma, value = token to be lemmatized - not case-sensitive
corr_es = {
    "decir":["dixo", "decia", "Dixo", "Decia"],
    "ir":["iba", "Iba"],
    "pacerer":["parecia", "Parecia"],
    "poder":["podia", "Podia"],
    "ser":["fuesse", "Fuesse"],
    "haber":["habia", "havia", "Habia", "Havia"],
    "ahora" : ["aora", "Aora"],
    "estar" : ["estàn", "Estàn"],
    "lujo" : ["luxo","luxar", "Luxo","Luxar"],
    "razón" : ["razon", "razòn", "Razon", "Razòn"],
    "caballero" : ["cavallero", "Cavallero"],
    "mujer" : ["muger", "mugeres", "Muger", "Mugeres"],
    "vez" : ["vèz", "Vèz"],
    "jamás" : ["jamas", "Jamas"],
    "demás" : ["demas", "demàs", "Demas", "Demàs"],
    "cuidar" : ["cuydado", "Cuydado"],
    "posible" : ["possible", "Possible"],
    "comedia":["comediar", "Comedias"],
    "poeta":["poetas", "Poetas"],
    "mano":["manir", "Manir"],
    "barba":["barbar", "Barbar"],
    "idea":["ideo", "Ideo"]
}
#replace lemma with key in lookup table
for key, value in corr_es.items():
    for token in value:
        correct = key
        wrong = token
        nlp.vocab.lookups.get_table("lemma_lookup")[token] = key
#process the text
nlp(text)

希望这对您有所帮助。

spacy 词形还原与 lemma_lookup table 不一致

spacy lemmatizing inconsistency with lemma_lookup table

python

nlp

lemmatization

spacy