Counter() 为所有单词返回 1。如何获得实际计数？

Question

我有一个文本文件，我试图从中获取最常用的单词。我正在使用 Counter，但似乎每个 return 1。

我正在学习，所以我将 Simple Sabotage Field Manual 用于我的文本文件。

import re
from collections import Counter
my_file = "fieldManual.txt"

#### GLOBAL VARIABLES
lst = [] # used in unique_words
cnt = Counter()

#########

def clean_word(the_word):
    #new_word = re.sub('[^a-zA-Z]', '',the_word)
    new_word = re.sub('^[^a-zA-z]*|[^a-zA-Z]*$', '', the_word)
    return new_word

def unique_words():
    with open(my_file, encoding="utf8") as infile:
        for line in infile:
            words = line.split()
            for word in words:
                edited_word = clean_word(word)
                if edited_word not in lst:
                    lst.append(edited_word)
                    cnt[edited_word] += 1
    lst.sort()  
    word_count = Counter(lst)
    return(lst)
    return (cnt)

unique_words()
test = ['apple','egg','apple','banana','egg','apple']
print(Counter(lst)) # returns '1' for everything
print(cnt) # same here

所以，print(Counter(test)) returns，正确，

Counter({'apple': 3, 'egg': 2, 'banana': 1})

但是我尝试在我的 lst returns

中打印最频繁的单词

Counter({'': 1, 'A': 1, 'ACTUAL': 1, 'AGREE': 1, 'AGREEMENT': 1, 'AK': 1, 'AND': 1, 'ANY': 1, 'ANYTHING': 1, 'AR': 1, 'AS-IS': 1, 'ASCII': 1, 'About': 1, 'Abstract': 1, 'Accidentally': 1, 'Act': 1, 'Acts': 1, 'Add': 1, 'Additional': 1, 'Adjust': 1, 'Advocate': 1, 'After': 1, 'Agriculture': 1, ...

根据答案 from here，我尝试在 if edited_word not in lst: 中添加 cnt.Update(edited_word)，但随后打印 cnt 我只得到单个字符：

Counter({'e': 2401, 'i': 1634, 't': 1470, 's': 1467, 'n': 1455, 'r': 1442, 'a': 1407, 'o': 1244, 'l': 948, 'c': 862, 'd': 752, 'u': 651, 'p': 590, 'g': 564, 'm': 436, ...

如何 return 我的 .txt 文件中每个唯一单词的频率？

Answer 1

如果尚未找到该词，您只需将其添加到列表中。因此，每个单词只会出现一次。

Answer 2

这里有一些错误。无论单词是否在列表中，您都应该增加计数器，或者只从拆分字符串调用列表中的计数器。你有背靠背 return 语句（第二个不会被执行）。您正在使用 word_count 查找列表的计数，然后忽略该输出（每个单词也为 1）。只是清理这段代码可能会帮助解决问题。

Counter() 为所有单词返回 1。如何获得实际计数？

Counter() is returning 1 for all words. How to get actual count?

collections

python-3.x

word-frequency