计算重复文件

Question

我是 Python 的新手，我想计算 60k 个文本文件中相同的内容，并列出所有不同的内容以及相同的数量，例如 uniq -c 但在文件上，而不是行上，级别。

到目前为止，我有：

from os import listdir
from os.path import isfile, join

mypath = "C:\Users\daniel.schneider\Downloads\Support"  # my Filepath
onlyfiles = [ f for f in listdir(mypath) if isfile(join(mypath,f)) ]

for currentFile in onlyfiles:
    currentPath = mypath + '\' + currentFile
    f = open(currentPath)
    print currentPath
    for currentLine in currentFile:
        print currentLine[24:]      
    f.close()
    break

Answer 1

我还没有彻底测试过，但是你可以使用 Python 的 hashlib 来获取每个文件的 MD5 散列，并将文件名存储在 list 关联字典中的每个散列。

然后，要获取唯一内容以及它出现在多少个文件中，遍历字典：

import os
import hashlib

mypath = 'testdup'
onlyfiles = [f for f in os.listdir(mypath)
                if os.path.isfile(os.path.join(mypath,f)) ]

files = {}
for filename in onlyfiles:
    filehash = hashlib.md5(open(os.path.join(mypath, filename), 'rb')
                              .read()).hexdigest()
    try:
        files[filehash].append(filename)
    except KeyError:
        files[filehash] = [filename]

for filehash, filenames in files.items():
    print('{0} files have this content:'.format(len(filenames)))
    print(open(os.path.join(mypath,filenames[0])).read())

计算重复文件

count duplicate Files

python

count

duplicates