Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: find duplicate files



On Sat, Apr 06, 2019 at 07:40:59AM +0200, Emanuel Berg wrote:
> Is this any good? Can it be done lineary?
> 
> TIA
> 
> #! /bin/zsh
> 
> find-duplicates () {
>     local -a files
>     [[ $# = 0 ]] && files=("${(@f)$(ls)}") || files=($@)
> 
>     local dups=0
> 
>     # files
>     local a
>     local b
> 
>     for a in $files; do
>         for b in $files; do
>             if [[ $a != $b ]]; then
>                 diff $a $b > /dev/null
>                 if [[ $? = 0 ]]; then
>                     echo $a and $b are the same
>                     dups=1
>                 fi
>             fi
>         done
>     done
>     [[ $dups = 0 ]] && echo "no duplicates"
> }
> alias dups=find-duplicates

Your function keeps comparing even after it finds duplicates, so its 
runtime will be O(N^2), i.e., proportional to the square of the sum of 
file sizes (N).  Here's one that calculates MD5 checksums, and compares 
those, and so is O(N) + O(M^2), i.e., proportional to the sum of file 
sizes plus the square of the number of files (M).

#!/bin/zsh
find-duplicates () {
    (( # > 0 )) || set -- *(.N)
    local dups=0
    md5sum $@ | sort | uniq -c |
    grep -qv '^  *1 ' | wc -l | read dups
    (( dups == 0 )) && echo "no duplicates"
}

A better solution would use an associative array (local -A NAME), would 
*not* sort, and would stop as soon as it found a duplicate, but I'll 
leave that as an exercise for the reader.  :-) 

Paul.

-- 
Paul Hoffman <nkuitse@xxxxxxxxxxx>



Messages sorted by: Reverse Date, Date, Thread, Author