按文件名删除目录中的重复文件 (linux)

Question

我有这样的目录结构

ARCHIVE_LOC -> epoch1 -> a.txt
                         b.txt

            -> epoch2 -> b.txt
                         c.txt

            -> epoch3 -> b.txt
                         c.txt

我有一个基本存档目录。此目录通过 rsync（定期）从 android 应用程序获取日志，这些日志保存在基于 rsync 进程的 epoch/timestamp 的目录中。我想删除所有重复的日志文件（它们具有相同的名称）并保留最新的。关于如何实现这一目标的任何帮助？

简而言之，我只想保留每个文件的最新文件。了解哪个文件是最新的一种方法是文件的大小，因为新文件的大小总是大于或等于旧文件。

Answer 1

编写以下脚本，对我来说效果很好。

# check base diectory provided exists
[ -e "" ] || {
    printf "\nError: invalid path. \n\n"
    exit 1
}

# find the files in base directory, sort them and filter out uniques, and iterate over the resulting list of files
# note: we're only filtering .json files here

for name in `find  -type f -printf "%f\n" | sort | uniq -d`; 
do  
    # we keep count of the duplicate files for a file to keep track of the last file(biggest in size)
    numDups=$(find  -name $name | wc -l); # number of duplicates found for a given file

for file in $(find  -name $name | sort -h); # sort the files again on basis of size
do

  if [ $numDups -ne 1 ];
  then
    if [ "$option" = -d ] # remove the duplicate file
    then
      rm $file
    else
      echo $file # if -d is not provided, just print the duplicate file names
      # note: this will print only the duplicate files, and not the latest/biggest file
    fi      
  fi
  numDups=$(($numDups-1))
  # note: as per current code, we are checking options value for each duplicate file
  # we can move the if conditions out of the for loop, but that would need duplication of code
  # we may try modifying the script otherwise, if we see serious performance issues.
  done
done;

exit 0;

Answer 2

#!/bin/bash
declare -A arr
shopt -s globstar

for file in **; do
    [[ -f "$file" ]] || continue
    read cksm _ < <(md5sum "$file")
    if ((arr[$cksm]++)); then 
    echo "rm $file"
    fi
done

[https://superuser.com/questions/386199/how-to-remove-duplicated-files-in-a-directory][1]

Answer 3

在 Debian 7 上，我设法得出以下 one-liner：

find path/to/folder -type f -name *.txt -printf '%Ts\t%p\n' | sort -nr | cut -f2 | perl -ne '/(\w+.txt)/; print if $seen{$&}++' | xargs rm

它很长，也许还有更短的方法，但它似乎可以解决问题。我在这里综合了调查结果

https://superuser.com/questions/608887/how-can-i-make-find-find-files-in-reverse-chronological-order

这里

Perl regular expression removing duplicate consecutive substrings in a string

按文件名删除目录中的重复文件 (linux)

Remove duplicate files by filename in a directory (linux)

linux

filesystems

scripting

duplicates

ubuntu-14.04