Python - NLTK 中的三连词概率分布平滑技术 (Kneser Ney) Returns 零

Python - Trigram Probability Distribution Smoothing Technique (Kneser Ney) in NLTK Returns Zero

我有我的三角字母的频率分布,然后训练 Kneser-Ney。 当我检查不在 list_of_trigrams 中的三元组的 kneser_ney.prob 时,我得到零!我做错了什么?

freq_dist = nltk.FreqDist(list_of_trigrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)

它甚至在列表中有n-1-gram,这就是我想要的:

print(kneser_ney.prob(('ئامادەكاری', 'بۆ', 'تاقیكردنەوە')))

这是我在列表中的内容

('ئامادەكاری', 'بۆ', 'كارە')

我在网上搜索过任何与我有同样问题的人,但我没有找到...

我认为你所观察到的是完全正常的。

来自 Kneser-Ney smoothing 的维基百科页面(方法部分):

Please note that p_KN is a proper distribution, as the values defined in above way are non-negative and sum to one.

ngram 没有出现在语料库中时,概率是 0

引自

This is the whole point of smoothing, to reallocate some probability mass from the ngrams appearing in the corpus to those that don't so that you don't end up with a bunch of 0 probability ngrams.

上面的句子并不意味着Kneser-Ney平滑你将有一个non-zero你选择的任何ngram的概率,这意味着,给定一个语料库,它将以这样的方式为现有的 ngram 分配一个概率,这样你就有一些 spare 概率在以后的分析中用于其他 ngram。 这个备用概率是必须为non-occurring ngram分配的东西,而不是Kneser-Ney平滑固有的东西.


编辑

为了完整起见,我报告代码以观察行为(主要取自 ,并适应 Python 3):

import nltk
nltk.download('gutenberg')
nltk.download('punkt')
from nltk.util import ngrams
from nltk.corpus import gutenberg

gut_ngrams = tuple(
    ngram for sent in gutenberg.sents()
    for ngram in ngrams(
        sent, 3, pad_left=True, pad_right=True,
        right_pad_symbol='EOS', left_pad_symbol="BOS"))
freq_dist = nltk.FreqDist(gut_ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)

prob_sum = 0
for i in kneser_ney.samples():
    if i[0] == "I" and i[1] == "confess":
        prob_sum += kneser_ney.prob(i)
        print("{0}:{1}".format(i, kneser_ney.prob(i)))
print(prob_sum)
# ('I', 'confess', ','):0.26973684210526316
# ('I', 'confess', 'that'):0.16447368421052633
# ('I', 'confess', '.--'):0.006578947368421052
# ('I', 'confess', 'it'):0.03289473684210526
# ('I', 'confess', 'I'):0.16447368421052633
# ('I', 'confess', ',"'):0.03289473684210526
# ('I', 'confess', ';'):0.006578947368421052
# ('I', 'confess', 'myself'):0.006578947368421052
# ('I', 'confess', 'is'):0.006578947368421052
# ('I', 'confess', 'also'):0.006578947368421052
# ('I', 'confess', 'unto'):0.006578947368421052
# ('I', 'confess', '"--'):0.006578947368421052
# ('I', 'confess', 'what'):0.006578947368421052
# ('I', 'confess', 'there'):0.006578947368421052
# 0.7236842105263156

# trigram not appearing in corpus
print(kneser_ney.prob(('I', 'confess', 'nothing')))
# 0.0