使用 NLTK Corpus conll2002 对荷兰推文进行情感分析
Sentiment analysis for Dutch tweets using NLTK Corpus conll2002
我需要对荷兰语推文列表进行情绪分析,我正在使用 conll2002
进行分析。这是我正在使用的代码:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import conll2002
import time
t=time.time()
def word_feats(words):
return dict([(word, True) for word in words])
#negids = conll2002.fileids('neg')
def train():
#negids = conll2002.fileids('neg')
#posids = conll2002.fileids('pos')
negids = conll2002.fileids()
posids = conll2002.fileids()
negfeats = [(word_feats(conll2002.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(conll2002.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()
x=train()
print x
print time.time()-t
以上代码有效,但输出如下:
train on 8 instances, test on 4 instances
accuracy: 0.5
Most Informative Features
poderlas = True pos : neg = 1.0 : 1.0
voert = True pos : neg = 1.0 : 1.0
contundencia = True pos : neg = 1.0 : 1.0
encuestocracia = None pos : neg = 1.0 : 1.0
alivien = None pos : neg = 1.0 : 1.0
Bogotá = True pos : neg = 1.0 : 1.0
Especialidades = True pos : neg = 1.0 : 1.0
hoofdredacteurs = True pos : neg = 1.0 : 1.0
quisieron = True pos : neg = 1.0 : 1.0
asciendan = None pos : neg = 1.0 : 1.0
None
9.21083234
pos:neg 比率在所有情况下都是 1:1。我该如何解决?我认为问题可能出在我目前在代码中注释掉的以下语句中:
negids = conll2002.fileids('neg')
posids = conll2002.fileids('pos')
如果我不注释掉以上两个语句,我得到的错误是:
Traceback (most recent call last):
File "naive1.py", line 31, in <module>
x=train()
File "naive1.py", line 13, in train
negids = conll2002.fileids('neg')
TypeError: fileids() takes exactly 1 argument (2 given)
我尝试使用 self 来解决这个问题,但它仍然不起作用。有人可以指出我正确的方向吗?提前致谢。
fileids()
方法接受 categories
参数,但仅在分类语料库中。例如:
>>> from nltk.corpus import brown
>>> brown.fileids("mystery")
['cl01', 'cl02', 'cl03', 'cl04', 'cl05', 'cl06', 'cl07', 'cl08', 'cl09',
'cl10', 'cl11', 'cl12', 'cl13', 'cl14', 'cl15', 'cl16', 'cl17', 'cl18',
'cl19', 'cl20', 'cl21', 'cl22', 'cl23', 'cl24']
您的调用失败,因为 CONLL 语料库没有类别。这是因为它们没有注释情感:CONLL 2000 和 CONLL 2002 都是分块语料库(分别为 NP/PP 和命名实体)。
>>> conll2002.categories()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'ConllChunkCorpusReader' object has no attribute 'categories'
所以对你的问题的简短回答是,你不能在 conll2002 语料库上训练情绪分析器。
我需要对荷兰语推文列表进行情绪分析,我正在使用 conll2002
进行分析。这是我正在使用的代码:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import conll2002
import time
t=time.time()
def word_feats(words):
return dict([(word, True) for word in words])
#negids = conll2002.fileids('neg')
def train():
#negids = conll2002.fileids('neg')
#posids = conll2002.fileids('pos')
negids = conll2002.fileids()
posids = conll2002.fileids()
negfeats = [(word_feats(conll2002.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(conll2002.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()
x=train()
print x
print time.time()-t
以上代码有效,但输出如下:
train on 8 instances, test on 4 instances
accuracy: 0.5
Most Informative Features
poderlas = True pos : neg = 1.0 : 1.0
voert = True pos : neg = 1.0 : 1.0
contundencia = True pos : neg = 1.0 : 1.0
encuestocracia = None pos : neg = 1.0 : 1.0
alivien = None pos : neg = 1.0 : 1.0
Bogotá = True pos : neg = 1.0 : 1.0
Especialidades = True pos : neg = 1.0 : 1.0
hoofdredacteurs = True pos : neg = 1.0 : 1.0
quisieron = True pos : neg = 1.0 : 1.0
asciendan = None pos : neg = 1.0 : 1.0
None
9.21083234
pos:neg 比率在所有情况下都是 1:1。我该如何解决?我认为问题可能出在我目前在代码中注释掉的以下语句中:
negids = conll2002.fileids('neg')
posids = conll2002.fileids('pos')
如果我不注释掉以上两个语句,我得到的错误是:
Traceback (most recent call last):
File "naive1.py", line 31, in <module>
x=train()
File "naive1.py", line 13, in train
negids = conll2002.fileids('neg')
TypeError: fileids() takes exactly 1 argument (2 given)
我尝试使用 self 来解决这个问题,但它仍然不起作用。有人可以指出我正确的方向吗?提前致谢。
fileids()
方法接受 categories
参数,但仅在分类语料库中。例如:
>>> from nltk.corpus import brown
>>> brown.fileids("mystery")
['cl01', 'cl02', 'cl03', 'cl04', 'cl05', 'cl06', 'cl07', 'cl08', 'cl09',
'cl10', 'cl11', 'cl12', 'cl13', 'cl14', 'cl15', 'cl16', 'cl17', 'cl18',
'cl19', 'cl20', 'cl21', 'cl22', 'cl23', 'cl24']
您的调用失败,因为 CONLL 语料库没有类别。这是因为它们没有注释情感:CONLL 2000 和 CONLL 2002 都是分块语料库(分别为 NP/PP 和命名实体)。
>>> conll2002.categories()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'ConllChunkCorpusReader' object has no attribute 'categories'
所以对你的问题的简短回答是,你不能在 conll2002 语料库上训练情绪分析器。