Python 的 collections.Counter 和 nltk.probability.FreqDist 之间的区别

Question

我想计算文本语料库中单词的词频。为了完成这项工作，我一直在使用 NLTK 的 word_tokenize 和 probability.FreqDist 一段时间。 word_tokenize return是一个列表，通过FreqDist转换为频率分布。但是，我最近遇到了集合中的 Counter 函数 (collections.Counter)，它似乎在做完全相同的事情。 FreqDist 和 Counter 都有一个 most_common(n) 函数，其中 return n 个最常见的词。有谁知道这两者之间是否有区别？一个比另一个快吗？是否存在一个可以工作而另一个不能工作的情况？

Answer 1

nltk.probability.FreqDist 是 collections.Counter 的子 class。

来自docs：

A frequency distribution for the outcomes of an experiment. A frequency distribution records the number of times each outcome of an experiment has occurred. For example, a frequency distribution could be used to record the frequency of each word type in a document. Formally, a frequency distribution can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome.

The inheritance is explicitly shown from the code and essentially, there's no difference in terms of how a Counter and FreqDist is initialized, see https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L106

所以speed-wise，创建Counter和FreqDist应该是一样的。速度上的差异应该是微不足道的，但值得注意的是，开销可能是：

在解释器中定义class时的编译
duck-typing的成本.__init__()

主要区别在于 FreqDist 为统计/概率自然语言处理 (NLP) 提供的各种功能，例如finding hapaxes。 FreqDist 扩展 Counter 的完整函数列表如下：

>>> from collections import Counter
>>> from nltk import FreqDist
>>> x = FreqDist()
>>> y = Counter()
>>> set(dir(x)).difference(set(dir(y)))
set(['plot', 'hapaxes', '_cumulative_frequencies', 'r_Nr', 'pprint', 'N', 'unicode_repr', 'B', 'tabulate', 'pformat', 'max', 'Nr', 'freq', '__unicode__'])

当谈到使用 FreqDist.most_common() 时，它实际上是在使用 Counter 的父函数，因此两种类型检索排序的 most_common 列表的速度是相同的。

就个人而言，当我只想检索计数时，我使用 collections.Counter。但是当我需要做一些统计操作时，我要么使用 nltk.FreqDist，要么将 Counter 转储到 pandas.DataFrame（参见）。

Python 的 collections.Counter 和 nltk.probability.FreqDist 之间的区别

Difference between Python's collections.Counter and nltk.probability.FreqDist

python

nlp

nltk