统计list of list中单词出现的次数

Question

我有大约 5000 个不同单词和 5000 行的数据集：

2 行示例

data = [["I", "am", "John"], ["Where", "is", "John","?"]]

我想做的是计算每个单词有多少个不同的单词。

result = {"I": 1, "am": 1, "John": 2, "Where":1, ...}

但不知道如何有效地做到这一点

有什么建议吗？

Answer 1

您可以这样使用列表理解

from collections import Counter
Counter([word for sentence in data for word in sentence])
# or even
Counter(word for sentence in data for word in sentence)
# so you don't create the list containing every word

Answer 2

好消息是 python 标准库中有很多方便的工具。

import itertools
from collections import Counter

data = [["I", "am", "John"], ["Where", "is", "John", "?"]]
result = Counter(itertools.chain(*data))
# result: Counter({'John': 2, 'I': 1, 'am': 1, 'Where': 1, 'is': 1, '?': 1})

星号（*data）是一种将可迭代项解包为参数形式的语法，对了我不好用文字来解释。让我们看看例子：

data = [1, 2, 3, 4, 5];
print(*data)
print(data[0], data[1], data[2], data[3], data[4])

第二行和第三行是等价的

Answer 3

我会给你一个高级算法。如果您需要实际代码，请告诉我。

创建一个名为 counts 的字典。
遍历 data.
对于 data 中的每个元素，遍历每个字符串。
对于每个字符串，检查该词是否在 counts 中。如果是，则增加计数。否则，设置 counts[word]=1.
最后，counts会有你想要的。

这需要 O(n) 时间，因为您只访问每个单词一次，因此这是执行此任务的最有效方法。

统计list of list中单词出现的次数

Count number of occurrences of words in list of list

python

nltk