不计算文本文件中的字符数

Not counting characters right in text file

我正在用文本文件 I/O 做另一个程序,我很困惑,因为我的代码看起来非常合理,但结果似乎很疯狂。我想统计政治演讲文本文件中的单词、字符、句子和独特单词的数量。这是我的代码,所以它可能会更清楚一些。

#This program will serve to analyze text files for the number of words in
#the text file, number of characters, sentances, unique words, and the longest
#word in the text file. This program will also provide the frequency of unique
#words. In particular, the text will be three political speeches which we will
#analyze, building on searching techniques in Python.
#CISC 101, Queen's University
#By Damian Connors; 10138187

def main():
    harper = readFile("Harper's Speech.txt")
    print(numCharacters(harper), "Characters.")
    obama1 = readFile("Obama's 2009 Speech.txt")
    print(numCharacters(obama1), "Characters.")
    obama2 = readFile("Obama's 2008 Speech.txt")
    print(numCharacters(obama1), "Characters.")

def readFile(filename):
    '''Function that reads a text file, then prints the name of file without
'.txt'. The fuction returns the read file for main() to call, and print's
the file's name so the user knows which file is read'''
    inFile1 = open(filename, "r")
    fileContentsList = inFile1.readlines()
    inFile1.close()
    print(filename.replace(".txt", "") + ":")  #this prints filename
    return fileContentsList

def numCharacters(file):
    return len(file) - file.count(" ")

我目前遇到的问题是计算字符数。它一直说 # 是 85,但它是一个相当大的文件,我知道它应该是 7792 个字符。知道我在做什么错吗?这是我的 shell 输出,我正在使用 python 3.3.3

>>> ================================ RESTART ================================
>>> 
Harper's Speech:
85 Characters.
Obama's 2009 Speech:
67 Characters.
Obama's 2008 Speech:
67 Characters.
>>> 

如你所见,我有 3 个语音文件,但它们不可能是那么少的字符。

你应该改变这一行 fileContentsList = inFile1.readlines() 现在你数一数奥巴马的演讲有多少行。 将 readLines 更改为 read() 它将起作用

readlines 函数 returns 包含 行的列表, 所以它的长度将是文件中的行数,不是字符数。

您要么必须找到一种方法来读入所有字符(以便长度正确),例如使用 read().

或者遍历每一行,计算其中的字符,可能类似于:

tot = 0
for line in file:
    tot = tot + len(line) - line.count(" ")
return tot

(当然,假设你实际选择的计算字符的方法是正确的)。


顺便说一句,您的第三个输出语句引用了 obama1 而不是 obama2,您可能也想修复它。

您正在计算行数。更详细地说,您正在有效地将文件读入行列表,然后对它们进行计数。下面是您的代码的清理版本。

def count_lines(filename):
    with open(filename) as stream:
        return len(stream.readlines())

对此类代码进行单词计数的最简单更改是读出整个文件并将其拆分为单词,然后对它们进行计数,请参见以下代码。

def count_words(filename):
    with open(filename) as stream:
        return len(stream.read().split())

备注:

  • 可能需要更新代码以匹配您对单词的确切定义。
  • 这种方法不适合非常大的文件,因为它将整个文件读入内存,单词列表也存储在那里。

因此上面的代码更多的是一个概念模型,而不是最好的最终解决方案。

您当前看到的是文件中的行数。由于 fileContentsList 将 return 一个列表,numCharacters 将 return 个列表的大小。

如果你想继续使用'readlines',你需要统计每一行的字符数,然后将它们相加得到文件的总字符数。

def main():
    print(readFile("Harper's Speech.txt"), "Characters.")
    print(readFile("Obama's 2009 Speech.txt"), "Characters.")
    print(readFile("Obama's 2008 Speech.txt"), "Characters.")

def readFile(filename):
    '''Function that reads a text file, then prints the name of file without
'.txt'. The fuction returns the read file for main() to call, and print's
the file's name so the user knows which file is read'''
    inFile1 = open(filename, "r")
    fileContentsList = inFile1.readlines()
    inFile1.close()
    totalChar =0    # Variable to store total number of characters
    for line in fileContentsList:    # reading all lines
        line = line.rstrip("\n")    # removing line end character '\n' from lines
        totalChar = totalChar + len(line) - line.count(" ")    # adding number of characters in line to total characters,
                                                               # also removing number of whitespaces in current line
    print(filename.replace(".txt", "") + ":")  #this prints filename
    return totalChar

main() # calling main function.