英语形态学软件
Morphology software for English
在我的应用程序中,我需要使用一款软件能够:a) 将单词转换为它们的基本形式和 b) 查找它们是否是 'nouns'、'verbs' 等
我找到了能够完成这项工作的软件列表。
http://aclweb.org/aclwiki/index.php?title=Morphology_software_for_English
有人对这些有任何经验吗?你推荐哪一个?
您可以使用 NLTK (Python) 来执行这些任务。
Find if they are 'nouns', 'verbs'...
此任务称为 Part-of-speech tagging. You can use the nltk.pos_tag
function. (See the Peen Treebank tagset)
Convert the words to their basic forms
这个任务叫做lemmatization。您可以使用 nltk.stem.wordnet.WordNetLemmatizer.lemmatize
函数。
例子
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet as wn
penn_to_wn = lambda penn_tag: {'NN':wn.NOUN,'JJ':wn.ADJ,'VB':wn.VERB,'RB':wn.ADV}.get(penn_tag[:2], wn.NOUN)
sentence = "The rabbits are eating in the garden."
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
wl = WordNetLemmatizer()
lemmas = [wl.lemmatize(token, pos=penn_to_wn(tag)) for token, tag in pos_tags]
那么如果你打印结果:
>>> tokens
['The', 'rabbits', 'are', 'eating', 'in', 'the', 'garden', '.']
>>> pos_tags
[('The', 'DT'),
('rabbits', 'NNS'),
('are', 'VBP'),
('eating', 'VBG'),
('in', 'IN'),
('the', 'DT'),
('garden', 'NN'),
('.', '.')]
>>> lemmas
['The', u'rabbit', u'be', u'eat', 'in', 'the', 'garden', '.']
在我的应用程序中,我需要使用一款软件能够:a) 将单词转换为它们的基本形式和 b) 查找它们是否是 'nouns'、'verbs' 等
我找到了能够完成这项工作的软件列表。
http://aclweb.org/aclwiki/index.php?title=Morphology_software_for_English
有人对这些有任何经验吗?你推荐哪一个?
您可以使用 NLTK (Python) 来执行这些任务。
Find if they are 'nouns', 'verbs'...
此任务称为 Part-of-speech tagging. You can use the nltk.pos_tag
function. (See the Peen Treebank tagset)
Convert the words to their basic forms
这个任务叫做lemmatization。您可以使用 nltk.stem.wordnet.WordNetLemmatizer.lemmatize
函数。
例子
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet as wn
penn_to_wn = lambda penn_tag: {'NN':wn.NOUN,'JJ':wn.ADJ,'VB':wn.VERB,'RB':wn.ADV}.get(penn_tag[:2], wn.NOUN)
sentence = "The rabbits are eating in the garden."
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
wl = WordNetLemmatizer()
lemmas = [wl.lemmatize(token, pos=penn_to_wn(tag)) for token, tag in pos_tags]
那么如果你打印结果:
>>> tokens
['The', 'rabbits', 'are', 'eating', 'in', 'the', 'garden', '.']
>>> pos_tags
[('The', 'DT'),
('rabbits', 'NNS'),
('are', 'VBP'),
('eating', 'VBG'),
('in', 'IN'),
('the', 'DT'),
('garden', 'NN'),
('.', '.')]
>>> lemmas
['The', u'rabbit', u'be', u'eat', 'in', 'the', 'garden', '.']