Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: find duplicate files

X-seq: zsh-users 23899
From: Paul Hoffman <nkuitse@xxxxxxxxxxx>
To: zsh-users@xxxxxxx
Subject: Re: find duplicate files
Date: Sat, 6 Apr 2019 09:02:42 -0400
In-reply-to: <86v9zrbsic.fsf@zoho.eu>
List-help: <mailto:zsh-users-help@zsh.org>
List-id: Zsh Users List <zsh-users.zsh.org>
List-post: <mailto:zsh-users@zsh.org>
List-unsubscribe: <mailto:zsh-users-unsubscribe@zsh.org>
Mail-followup-to: zsh-users@xxxxxxx
Mailing-list: contact zsh-users-help@xxxxxxx; run by ezmlm
References: <86v9zrbsic.fsf@zoho.eu>

On Sat, Apr 06, 2019 at 07:40:59AM +0200, Emanuel Berg wrote:
> Is this any good? Can it be done lineary?
> 
> TIA
> 
> #! /bin/zsh
> 
> find-duplicates () {
>     local -a files
>     [[ $# = 0 ]] && files=("${(@f)$(ls)}") || files=($@)
> 
>     local dups=0
> 
>     # files
>     local a
>     local b
> 
>     for a in $files; do
>         for b in $files; do
>             if [[ $a != $b ]]; then
>                 diff $a $b > /dev/null
>                 if [[ $? = 0 ]]; then
>                     echo $a and $b are the same
>                     dups=1
>                 fi
>             fi
>         done
>     done
>     [[ $dups = 0 ]] && echo "no duplicates"
> }
> alias dups=find-duplicates

Your function keeps comparing even after it finds duplicates, so its 
runtime will be O(N^2), i.e., proportional to the square of the sum of 
file sizes (N).  Here's one that calculates MD5 checksums, and compares 
those, and so is O(N) + O(M^2), i.e., proportional to the sum of file 
sizes plus the square of the number of files (M).

#!/bin/zsh
find-duplicates () {
    (( # > 0 )) || set -- *(.N)
    local dups=0
    md5sum $@ | sort | uniq -c |
    grep -qv '^  *1 ' | wc -l | read dups
    (( dups == 0 )) && echo "no duplicates"
}

A better solution would use an associative array (local -A NAME), would 
*not* sort, and would stop as soon as it found a duplicate, but I'll 
leave that as an exercise for the reader.  :-) 

Paul.

-- 
Paul Hoffman <nkuitse@xxxxxxxxxxx>

Follow-Ups:
- Re: find duplicate files
  - From: Emanuel Berg

References:
- find duplicate files
  - From: Emanuel Berg

Messages sorted by: Reverse Date, Date, Thread, Author