NLTK 中的 Unigram 标记
Unigram tagging in NLTK
使用 NLTK
Unigram Tagger,我正在 Brown Corpus
中训练句子
我尝试了不同的方法 categories
,但得到的值大致相同。对于每个 categories
,例如 fiction
、romance
或 humor
,该值约为 0.9328
...
from nltk.corpus import brown
# Fiction
brown_tagged_sents = brown.tagged_sents(categories='fiction')
brown_sents = brown.sents(categories='fiction')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9415956079897209
# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')
brown_sents = brown.sents(categories='romance')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9348490474422324
为什么会这样?是因为他们来自同一个corpus
吗?还是他们的 part-of-speech
标签相同?
看起来你正在训练,然后在相同的训练数据上评估训练的UnigramTagger
。查看 nltk.tag and specifically the part 关于评估的文档。
使用您的代码,您将获得高分,这是很明显的,因为您的训练数据和 evaluation/testing 数据相同。如果您要更改测试数据与训练数据不同的地方,您将得到不同的结果。我的例子如下:
类别:小说
这里我使用的训练集为brown.tagged_sents(categories='fiction')[:500]
,test/evaluation设置为brown.tagged_sents(categories='fiction')[501:600]
from nltk.corpus import brown
import nltk
# Fiction
brown_tagged_sents = brown.tagged_sents(categories='fiction')[:500]
brown_sents = brown.sents(categories='fiction') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='fiction')[501:600])
这给你一个分数~0.7474610697359513
类别:浪漫
这里我使用的训练集为brown.tagged_sents(categories='romance')[:500]
,test/evaluation设置为brown.tagged_sents(categories='romance')[501:600]
from nltk.corpus import brown
import nltk
# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')[:500]
brown_sents = brown.sents(categories='romance') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='romance')[501:600])
这给你一个分数~0.7046799354491662
我希望这对您有所帮助并回答您的问题。
使用 NLTK
Unigram Tagger,我正在 Brown Corpus
我尝试了不同的方法 categories
,但得到的值大致相同。对于每个 categories
,例如 fiction
、romance
或 humor
0.9328
...
from nltk.corpus import brown
# Fiction
brown_tagged_sents = brown.tagged_sents(categories='fiction')
brown_sents = brown.sents(categories='fiction')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9415956079897209
# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')
brown_sents = brown.sents(categories='romance')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9348490474422324
为什么会这样?是因为他们来自同一个corpus
吗?还是他们的 part-of-speech
标签相同?
看起来你正在训练,然后在相同的训练数据上评估训练的UnigramTagger
。查看 nltk.tag and specifically the part 关于评估的文档。
使用您的代码,您将获得高分,这是很明显的,因为您的训练数据和 evaluation/testing 数据相同。如果您要更改测试数据与训练数据不同的地方,您将得到不同的结果。我的例子如下:
类别:小说
这里我使用的训练集为brown.tagged_sents(categories='fiction')[:500]
,test/evaluation设置为brown.tagged_sents(categories='fiction')[501:600]
from nltk.corpus import brown
import nltk
# Fiction
brown_tagged_sents = brown.tagged_sents(categories='fiction')[:500]
brown_sents = brown.sents(categories='fiction') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='fiction')[501:600])
这给你一个分数~0.7474610697359513
类别:浪漫
这里我使用的训练集为brown.tagged_sents(categories='romance')[:500]
,test/evaluation设置为brown.tagged_sents(categories='romance')[501:600]
from nltk.corpus import brown
import nltk
# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')[:500]
brown_sents = brown.sents(categories='romance') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='romance')[501:600])
这给你一个分数~0.7046799354491662
我希望这对您有所帮助并回答您的问题。