为什么在标记文本语料库时需要一个阈值？

Question

所以我是一名自学 NLP，遇到了 this kaggle notebook 使用 LSTM 进行文本摘要。当它把 orderedDict 个单词变成整数时，显然有一些代码可以计算词汇表中稀有单词的百分比：

thresh=4

cnt, tot_cnt, freq, tot_freq = 0, 0, 0, 0

for key,value in x_tokenizer.word_counts.items():
    tot_cnt += 1
    tot_freq += value
    if(value < thresh):
        cnt += 1
        freq += value
    
print("% of rare words in vocabulary:",(cnt/tot_cnt)*100)
print("Total Coverage of rare words:",(freq/tot_freq)*100)

为什么那里的阈值是4？据我所知，单词到整数的映射是任意的（除非每个整数 = 单词重复的次数），所以 4 的阈值对我来说似乎有点随意。

在此先感谢您的帮助:)

Answer 1

阈值让您有机会忽略对词袋处理贡献不大的“罕见”词。同样，您可能希望有一个上限阈值，这样您就可以忽略诸如“the”、“a”等词，因为它们普遍存在，对区分句子类也没有太大帮助。

为什么在标记文本语料库时需要一个阈值？

Why do you need a threshold when tokenizing a text corpus?

python

nlp

keras