Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: find duplicate files



On Sun, Apr 7, 2019 at 4:19 AM Charles Blake <charlechaud@xxxxxxxxx> wrote:
>
>
> Zeroeth, you may want to be careful of zero length files which are all
> identical, but also take up little to no space beyond their pathname.

This is just **/*(.l+0) in zsh (the dot is to ignore directories).

> Another zeroth order concern is
> file identity.  I-node/file numbers are better here than path names.

You can get all these numbers reasonably fast (unless some sort of
networked storage is involved) with

zmodload zsh/stat
zstat -A stats **/*(.l+0)

Of course if you're talking really huge numbers of files you might not
want to collect this all at once.

> First, and most importantly, the file size itself acts as a weak hash

That's also snarfled up by zstat -A.  It doesn't really take any
longer to get the file size from stat than it does the inode, so
unless you're NOT going to consider linked files as duplicates you
might as well just compare sizes.  (It would be faster to get inodes
from readdir, but there's no shell-level readdir.)  More on this
later.

> Second, if you have a fast IO system (eg., your data all fits in RAM)
> then time to strongly hash likely dominates the calculation, but that
> also parallelizes easily.

You're probably not going to beat built-in lightweight threading and
shared memory with shell process-level "threading", so if you have a
large number of files that are the same size but with different
contents then something like python or C might be the way to go.

> Third, in theory, even strong crypto hashes have a small but non-zero
> chance of a collision.  So, you may need to deal with the possibility
> of false positives anyway.

This should be vanishingly small if the sizes are the same?  False
positives can be checked by using "cmp -s" which just compares the
files byte-by-byte until it it finds a difference, but this is a
pretty hefty penalty on the true positives.

Anyway, here's zsh code; I haven't dealt with the files having strange
characters in the names that might prevent them being easily used as
hash keys, that's left as an exercise for somebody who has files with
strange names:

zmodload zsh/stat
zstat -nA stats **/*(.l+0)
# Every stat struct has 15 elements, so we pick out every 15th
names=( ${(e):-'${stats['{1..${#stats}..15}']}'} ) # name is element 1
sizes=( ${(e):-'${stats['{9..${#stats}..15}']}'} ) # size is element 9
# Zip the two arrays to make a mapping
typeset -A clusters sizemap=( ${names:^sizes} )
# Compute clusters of same-sized files
for i in {1..$#sizes}
do
  same=( ${(k)sizemap[(R)$sizes[i]]} )
  (( $#same > 0 )) || continue
  # Delete entries we've seen so we don't find them again
  unset 'sizemap['${^same}']'
  (( $#same > 1 )) && clusters[$sizes[i]]=${(@qq)same}
done
# Calculate checksums by cluster and report duplicates
typeset -A sums
for f in ${(v)clusters}
do
  # Inner loop could be put in a function and outer loop replaced by zargs -P
  for sum blocks file in $( eval cksum $f )
  do
    if (( ${+sums[$sum.$blocks]} ))
    then print Duplicate: ${sums[$sum.$blocks]} $file
    else sums[$sum.$blocks]=$file
    fi
  done
done

I find that a LOT more understandable than the python code.



Messages sorted by: Reverse Date, Date, Thread, Author