Python

Question

我有我的三角字母的频率分布，然后训练 Kneser-Ney。当我检查不在 list_of_trigrams 中的三元组的 kneser_ney.prob 时，我得到零！我做错了什么？

freq_dist = nltk.FreqDist(list_of_trigrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)

它甚至在列表中有n-1-gram，这就是我想要的：

print(kneser_ney.prob(('ئامادەكاری', 'بۆ', 'تاقیكردنەوە')))

这是我在列表中的内容

('ئامادەكاری', 'بۆ', 'كارە')

我在网上搜索过任何与我有同样问题的人，但我没有找到...

Answer 1

我认为你所观察到的是完全正常的。

来自 Kneser-Ney smoothing 的维基百科页面（方法部分）：

Please note that p_KN is a proper distribution, as the values defined in above way are non-negative and sum to one.

当 ngram 没有出现在语料库中时，概率是 0。

引自：

This is the whole point of smoothing, to reallocate some probability mass from the ngrams appearing in the corpus to those that don't so that you don't end up with a bunch of 0 probability ngrams.

上面的句子并不意味着Kneser-Ney平滑你将有一个non-zero你选择的任何ngram的概率，这意味着，给定一个语料库，它将以这样的方式为现有的 ngram 分配一个概率，这样你就有一些 spare 概率在以后的分析中用于其他 ngram。这个备用概率是你必须为non-occurring ngram分配的东西，而不是Kneser-Ney平滑固有的东西.

编辑

为了完整起见，我报告代码以观察行为（主要取自，并适应 Python 3）：

import nltk
nltk.download('gutenberg')
nltk.download('punkt')

from nltk.util import ngrams
from nltk.corpus import gutenberg

gut_ngrams = tuple(
    ngram for sent in gutenberg.sents()
    for ngram in ngrams(
        sent, 3, pad_left=True, pad_right=True,
        right_pad_symbol='EOS', left_pad_symbol="BOS"))
freq_dist = nltk.FreqDist(gut_ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)

prob_sum = 0
for i in kneser_ney.samples():
    if i[0] == "I" and i[1] == "confess":
        prob_sum += kneser_ney.prob(i)
        print("{0}:{1}".format(i, kneser_ney.prob(i)))
print(prob_sum)
# ('I', 'confess', ','):0.26973684210526316
# ('I', 'confess', 'that'):0.16447368421052633
# ('I', 'confess', '.--'):0.006578947368421052
# ('I', 'confess', 'it'):0.03289473684210526
# ('I', 'confess', 'I'):0.16447368421052633
# ('I', 'confess', ',"'):0.03289473684210526
# ('I', 'confess', ';'):0.006578947368421052
# ('I', 'confess', 'myself'):0.006578947368421052
# ('I', 'confess', 'is'):0.006578947368421052
# ('I', 'confess', 'also'):0.006578947368421052
# ('I', 'confess', 'unto'):0.006578947368421052
# ('I', 'confess', '"--'):0.006578947368421052
# ('I', 'confess', 'what'):0.006578947368421052
# ('I', 'confess', 'there'):0.006578947368421052
# 0.7236842105263156

# trigram not appearing in corpus
print(kneser_ney.prob(('I', 'confess', 'nothing')))
# 0.0

Python - NLTK 中的三连词概率分布平滑技术 (Kneser Ney) Returns 零

Python - Trigram Probability Distribution Smoothing Technique (Kneser Ney) in NLTK Returns Zero

probability

nltk

smoothing

python-3.x

编辑