如何比较两个文本文件中的词频?

How to compare word frequencies from two text files?

如何比较python中两个文本文件的词频?例如,如果一个词同时包含在 file1 和 file2 中,那么它应该只写一次,而不是在比较时添加它们的频率,它应该是 {'The': 3,5}。这里 3 是文件 1 中的频率,5 是文件 2 中的频率。如果某些单词只存在于一个文件中而不存在于两个文件中,那么该文件应该为 0。请帮助 这是我到目前为止所做的:

import operator
f1=open('file1.txt','r') #file 1
f2=open('file2.txt','r') #file 2

wordlist=[]
wordlist2=[]
for line in f1:
    for word in line.split():
        wordlist.append(word)

for line in f2:
    for word in line.split():
        wordlist2.append(word)

worddictionary = {}
for word in wordlist:
    if word in worddictionary:
        worddictionary[word] += 1
    else:
        worddictionary[word] = 1

worddictionary2 = {}
for word in wordlist2:
    if word in worddictionary2:
        worddictionary2[word] += 1
    else:
        worddictionary2[word] = 1

print(worddictionary)
print(worddictionary2)

编辑:这是对任何文件列表执行此操作的更通用的方法(注释中的解释):

f1=open('file1.txt','r') #file 1
f2=open('file2.txt','r') #file 2

file_list = [f1, f2] # This would hold all your open files
num_files = len(file_list)

frequencies = {} # We'll just make one dictionary to hold the frequencies

for i, f in enumerate(file_list): # Loop over the files, keeping an index i
    for line in f: # Get the lines of that file
        for word in line.split(): # Get the words of that file
            if not word in frequencies:
                frequencies[word] = [0 for _ in range(num_files)] # make a list of 0's for any word you haven't seen yet -- one 0 for each file

            frequencies[word][i] += 1 # Increment the frequency count for that word and file

print frequencies

根据您编写的代码,以下是创建组合字典的方法:

import operator
f1=open('file1.txt','r') #file 1
f2=open('file2.txt','r') #file 2

wordlist=[]
wordlist2=[]
for line in f1:
    for word in line.split():
        wordlist.append(word)

for line in f2:
    for word in line.split():
        wordlist2.append(word)

worddictionary = {}
for word in wordlist:
    if word in worddictionary:
        worddictionary[word] += 1
    else:
        worddictionary[word] = 1

worddictionary2 = {}
for word in wordlist2:
    if word in worddictionary2:
        worddictionary2[word] += 1
    else:
        worddictionary2[word] = 1

# Create a combined dictionary
combined_dictionary = {}
all_word_set = set(worddictionary.keys()) | set(worddictionary2.keys())
for word in all_word_set:
    combined_dictionary[word] = [0,0]
    if word in worddictionary:
        combined_dictionary[word][0] = worddictionary[word]
    if word in worddictionary2:
        combined_dictionary[word][1] = worddictionary2[word]


print(worddictionary)
print(worddictionary2)
print(combined_dictionary)

编辑:我误解了问题,代码现在可以解决你的问题。

f1 = open('file1.txt','r') #file 1
f2 = open('file2.txt','r') #file 2

wordList = {}

for line in f1.readlines(): #for each line in lines (file.readlines() returns a list)
    for word in line.split(): #for each word in each line
        if(not word in wordList): #if the word is not already in our dictionary
            wordList[word] = 0 #Add the word to the dictionary

for line in f2.readlines(): #for each line in lines (file.readlines() returns a list)
    for word in line.split(): #for each word in each line
        if(word in wordList): #if the word is already in our dictionary
            wordList[word] = wordList[word]+1 #add one to it's value

f1.close() #close files
f2.close()

f1 = open('file1.txt','r') #Have to re-open because we are at the end of the file.
#might be a n easier way of doing this

for line in f1.readlines(): #Removing keys whose values are 0
    for word in line.split(): #for each word in each line
        try:
            if(wordList[word] == 0): #if it's value is 0
                del wordList[word] #remove it from the dictionary
            else:
                wordList[word] = wordList[word]+1 #if it's value is not 0, add one to it for each occurrence in file1
        except:
            pass #we know the error was that there was no wordList[word]
f1.close()

print(wordList)

添加第一个文件单词,如果该单词在第二个文件中,则将值加一。 之后,检查每个单词,如果它的值为0,则将其删除。

这不能通过迭代字典来完成,因为它在迭代时会改变大小。

这是为多个文件(更复杂)实现它的方式:

f1 = open('file1.txt','r') #file 1
f2 = open('file2.txt','r') #file 2

fileList = ["file1.txt", "file2.txt"]
openList = []
for i in range(len(fileList)):
    openList.append(open(fileList[i], 'r'))

fileWords = []

for i, file in enumerate(openList): #for each file
    fileWords.append({}) #add a dictionary to our list
    for line in file: #for each line in each file
        for word in line.split(): #for each word in each line
            if(word in fileWords[i]): #if the word is already in our dictionary
                fileWords[i][word] += 1 #add one to it
            else:
                fileWords[i][word] = 1 #add it to our dictionary with value 0

for i in openList:
    i.close()

for i, wL in enumerate(fileWords):
    print(f"File: {fileList[i]}")
    for l in wL.items():
        print(l)
    #print(f"File {i}\n{wL}")

您可能会发现以下演示程序是获取文件词频的良好起点:

#! /usr/bin/env python3
import collections
import pathlib
import pprint
import re
import sys


def main():
    freq = get_freq(sys.argv[0])
    pprint.pprint(freq)


def get_freq(path):
    if isinstance(path, str):
        path = pathlib.Path(path)
    return collections.Counter(
        match.group() for match in re.finditer(r'\b\w+\b', path.open().read())
    )


if __name__ == '__main__':
    main()

特别是,您需要使用 get_freq 函数来获取一个 Counter 对象,告诉您词频是什么。您的程序可以使用不同的文件名多次调用 get_freq 函数,您应该会发现 Counter 对象与您之前使用的字典非常相似。