发生得分

Question

我得到一个词频，我想将 number_of_occuerence 转换为 0-10 之间的数字。

word     number_of_occurrence      score
and      200                       10
png      2                         1
where    50                        6 
news     120                       7

Answer 1

得分在0-10之间。出现 50 次的最大分数为 10，因此高于此的任何分数也应为 10。另一方面，最低分数为 0，而出现 5 次的分数为 1，因此假设低于此的分数为 0。

插值仅基于您给定的条件：

If a word appear 50 times it should be closer to 10 and if a word appear 5 times it should be closer to 1.

df['score'] = df['number_of_occurrence'].apply(lambda x: x/5 if 5<=x<=50 else (0 if x< 5 else 10))

输出：

Answer 2

如果你想对语料库中的术语频率进行评分，我建议你阅读这篇维基百科文章：Term frequency–inverse document frequency。

计算词频的方法有很多种。
我知道想给它打 0 到 10 分。
我不明白你是如何计算你 score 值的例子。
无论如何，我建议你一个常用的方法：日志功能。

$0 < \log (1 + f_{t,d})< 1$

#count the occurrences of you terms
freq_table = {}
  words = tokenize(sentence)
  for word in words:
    word = word.lower()
    #stem the word if you can, using nltk
    if word in stopWords:#do you really want to count the occurrences of 'and'?
      continue

    if word in freq_table:
      freq_table[word] += 1
    else:
      freq_table[word] = 1
#log normalize the occurrences
for wordCount in freq_table.values():
  wordCount = 10*math.log(1+wordCount)

当然，您可以使用最大值标准化来代替对数标准化。
$0 < \frac { f_{t,d} }{\max_{{t' \in d}} {f_{t',d}}}< 1$

#ratio max normalize the occurrences
max = max(freq_table.values())
for wordCount in freq_table.values():
  wordCount = 10*wordCount/max

或者如果您需要阈值效果，您可以使用 sigmoid function 您可以自定义： $\frac{1}{1 + {\rm e}^{- x}}$

有关更多文字处理，请查看 Natural Language Toolkit。对于一个好的术语频率计数词干化是一个不错的选择（停用词也很有用）！

发生得分

occurrence to score

deep-learning