显示每个单词的字数

Question

我很难在 Google Colab 上对文档 Wuthering Heights (https://www.gutenberg.org/files/768/768.txt) 进行前 15 个字数统计（每个字的字数统计）。它只能包含在“ccx074@pglaf.org”之后开始并在“END OF THE PROGECT GUTENBERG EBOOK WUTHERING HEIGHTS.”之前结束的词。这是我试过的编码。

file = open(768.txt,'r+')
wordcount = {}
for word in file.read().split():
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] +=1
for k,v in wordcount.items():
    print(k,v)

Answer 1

您可以使用正则表达式查找所需的子字符串：

file = open('768.txt','r')
start = 'ccx074@pglaf.org'
end = 'END OF THE PROJECT GUTENBERG EBOOK WUTHERING HEIGHTS'
import re

m = re.findall(start+'(.*?)'+end, file.read(), flags=re.S)[0]
wordcount={}
for word in m.split():
  if word not in wordcount:
    wordcount[word] = 1
  else:
      wordcount[word] +=1
for k,v in wordcount.items():
  print(k,v)

示例输出：

WUTHERING 1
HEIGHTS 1
CHAPTER 34
I 3215
1801.--I 1
have 594
just 72
returned 39
from 476
...

但是，您可以使用 built-in 函数计算字数。例如，这个：

from collections import Counter
print(Counter(m.split()))

#Counter({'the': 4273, 'and': 4189, 'to': 3436, ...})

编辑：打印排序：

sorted(Counter(m.split()).items(), key=lambda x:x[1])

或从高到低反转：

sorted(Counter(m.split()).items(), key=lambda x:x[1], reverse=True)

Answer 2

在 string punctuation 和 operator itemgetter 的帮助下，这可能是一种方法。这将接近。请注意，删除标点符号将消除结尾 (.!?)，以获得干净的单词。（同时删除撇号（您可能不想删除）

from collections import Counter
from string import punctuation
from operator import itemgetter

d = Counter()

with open('wuthering_heights.txt', 'r') as f:
    opening = False

    for line in f:
        if line.startswith('ccx074@pglaf.org'):
            opening = True
        if opening == False:
            continue
        if line.startswith('CHAPTER'): # don't count chapter headings
            continue
        if line.startswith('***END OF THE PROJECT GUTENBERG EBOOK'):
            break
        
        line = line.strip()
        if len(line) == 0:
            continue
        
        # clean out punctuation
        line = line.translate(str.maketrans('','',punctuation))
        
        d.update(line.lower().split())

        

print('different words count', len(d)        )
#print(d.most_common(15))

for word, count in reversed(sorted(d.items(), key=itemgetter(1))):
    print(word, count)
    if count < 290:
        break

这会打印：

different words count 10098
and 4693
the 4552
i 3530
to 3476
a 2301
of 2221
he 1922
you 1712
her 1544
in 1459
his 1419
it 1284
she 1269
that 1188
was 1124
my 1098
me 1047
not 932
as 931
him 917
for 836
on 809
with 804
at 783
be 724
had 687
but 673
is 649
have 629
from 485
by 451
would 442
if 440
heathcliff 413
your 404
no 384
said 368
so 357
were 354
linton 340
catherine 333
an 317
we 311
mr 309
or 307
when 307
out 305
what 301
are 295
this 290
they 283

显示每个单词的字数

Showing the Word Count for Each Word

python

computer-science

google-colaboratory