nltk 的新手,条件频率有问题
New to nltk, having trouble with conditional frequency
我是 python 和 nltk 的新手(我 2 小时前开始)。这是我被要求做的事情:
Write a function GetAmbigousWords(corpus, N) that finds words in the
corpus with more than N observed tags. This function should return a
ConditionalFreqDist object where the conditions are the words and the
frequency distribution indicates the tag frequencies for each word.
这是我到目前为止所做的:
def GetAmbiguousWords(corpus, number):
conditional_frequency = ConditionalFreqDist()
word_tag_dict = defaultdict(set) # Creates a dictionary of sets
for (word, tag) in corpus:
word_tag_dict[word].add(tag)
for taggedWord in word_tag_dict:
if ( len(word_tag_dict[taggedWord]) >= number ):
condition = taggedWord
conditional_frequency[condition] # do something, I don't know what to do
return conditional_frequency
例如以下是该函数的行为方式:
GetAmbiguousWords(nltk.corpus.brown.tagged_words(categories='news'), 4)
我想知道我是在正确的轨道上还是完全偏离了轨道?特别是我不是很懂conditional frequency.
提前致谢。
使用频率分布,您可以收集一个词在文本中出现的频率:
text = "cow cat mouse cat tiger"
fDist = FreqDist(word_tokenize(text))
for word in fDist:
print "Frequency of", word, fDist.freq(word)
这将导致:
Frequency of tiger 0.2
Frequency of mouse 0.2
Frequency of cow 0.2
Frequency of cat 0.4
现在,条件频率基本相同,但您添加了一个条件,您可以根据该条件对频率进行分组。例如。按字长分组:
cfdist = ConditionalFreqDist()
for word in word_tokenize(text):
condition = len(word)
cfdist[condition][word] += 1
for condition in cfdist:
for word in cfdist[condition]:
print "Cond. frequency of", word, cfdist[condition].freq(word), "[condition is word length =", condition, "]"
这将打印:
Cond. frequency of cow 0.333333333333 [condition is word length = 3 ]
Cond. frequency of cat 0.666666666667 [condition is word length = 3 ]
Cond. frequency of tiger 0.5 [condition is word length = 5 ]
Cond. frequency of mouse 0.5 [condition is word length = 5 ]
希望对您有所帮助。
我是 python 和 nltk 的新手(我 2 小时前开始)。这是我被要求做的事情:
Write a function GetAmbigousWords(corpus, N) that finds words in the corpus with more than N observed tags. This function should return a ConditionalFreqDist object where the conditions are the words and the frequency distribution indicates the tag frequencies for each word.
这是我到目前为止所做的:
def GetAmbiguousWords(corpus, number):
conditional_frequency = ConditionalFreqDist()
word_tag_dict = defaultdict(set) # Creates a dictionary of sets
for (word, tag) in corpus:
word_tag_dict[word].add(tag)
for taggedWord in word_tag_dict:
if ( len(word_tag_dict[taggedWord]) >= number ):
condition = taggedWord
conditional_frequency[condition] # do something, I don't know what to do
return conditional_frequency
例如以下是该函数的行为方式:
GetAmbiguousWords(nltk.corpus.brown.tagged_words(categories='news'), 4)
我想知道我是在正确的轨道上还是完全偏离了轨道?特别是我不是很懂conditional frequency.
提前致谢。
使用频率分布,您可以收集一个词在文本中出现的频率:
text = "cow cat mouse cat tiger"
fDist = FreqDist(word_tokenize(text))
for word in fDist:
print "Frequency of", word, fDist.freq(word)
这将导致:
Frequency of tiger 0.2
Frequency of mouse 0.2
Frequency of cow 0.2
Frequency of cat 0.4
现在,条件频率基本相同,但您添加了一个条件,您可以根据该条件对频率进行分组。例如。按字长分组:
cfdist = ConditionalFreqDist()
for word in word_tokenize(text):
condition = len(word)
cfdist[condition][word] += 1
for condition in cfdist:
for word in cfdist[condition]:
print "Cond. frequency of", word, cfdist[condition].freq(word), "[condition is word length =", condition, "]"
这将打印:
Cond. frequency of cow 0.333333333333 [condition is word length = 3 ]
Cond. frequency of cat 0.666666666667 [condition is word length = 3 ]
Cond. frequency of tiger 0.5 [condition is word length = 5 ]
Cond. frequency of mouse 0.5 [condition is word length = 5 ]
希望对您有所帮助。