Python 循环运行时间极长

Question

我正在尝试从一组字符串中创建一个词汇表，然后删除该组中至少 30 个字符串中未重复的所有单词。集合中总共有大约 300,000 个单词。出于某种原因，检查一个单词是否已重复 30 次的代码的运行时间至少超过 5 分钟，我想知道如何使这段代码更高效以使其具有合理的运行时间。谢谢！

word_list = []
for item in ex_set:
    word_list += (list(dict.fromkeys(item.split()))) #remove unique words

vocab_list = []
for word in word_list: #where it runs forever
    if word_list.count(word) >= 30:
        vocab_list.append(word)

Answer 1

如果你想得到出现至少30次的单词列表中的所有单词，你可以先用collections.Counter统计它们，然后找出所有出现超过30次的单词.

from collections import Counter
word_counts = Counter(ex_set)

vocab_list = [word for word, count in words.items() if count >= 30]

另外请注意，不要将单词 set 用作变量名，因为它是关键字

Answer 2

这里是另一种思考问题的方式：

每次调用 count 都会再次遍历整个列表（二次时间）。

如果您构建 dict 字数统计，这是一个较小的数据结构来检查第二次迭代：

from collections import defaultdict

counter_dict = defaultdict(int)
for word in word_list:
    counter_dict[word] += 1

vocab_list = []
for word, count in counter_dict.items()
    if count >= 30:
        vocab_list.append(word)

看过Jmonsky的回答，如果可行，应该采纳

Python 循环运行时间极长

Python loop has extremely long runtime

python

runtime