Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: find duplicate files



Several optimization points bear mentioning.  The performance impact of
all of these depend upon the nature of your file collection.  Taken
together these optimizations may compound to do thousands of times less
work.  Adding no benefit at all is rare in my experience.

Zeroeth, you may want to be careful of zero length files which are all
identical, but also take up little to no space beyond their pathname.
These may or may not be of interest in your specific use case, but are
common enough to warrant mentioning.  Another zeroth order concern is
file identity.  I-node/file numbers are better here than path names.
Some separate "map out hard links" utility is better for those.

First, and most importantly, the file size itself acts as a weak hash
that is maintained by the OS/FS.  You do not need to hash or compare
files with different sizes since a different size already tells you the
content is different.  For many file collections, the size itself is a
strong enough hash which makes the running time very similar to the time
of "find".

Your files could all be identical sizes in which case this does not
help, but this is a very cheap preprocessing step, often barely more
expensive than computing a set of files of interest.  It results in a
set of clusters of possible duplicates which can then be tested more
precisely.  In my peculiar file sets I got a speed-up of 10x .. 100x
from this preprocessing step.

Second, if you have a fast IO system (eg., your data all fits in RAM)
then time to strongly hash likely dominates the calculation, but that
also parallelizes easily.  In my situations, I found parallel speed-up
to be better if you just hash over a full list of files in any maybe-
duplicated cluster than trying to only hash "cluster by cluster", but
I mostly was finding clusters small compared to my CPU core count.

Larger clusters or more need for as-you-go progress reporting could
alter the desirability of this aspect of hashing parallelization.
The speed-up of this step depends on how many CPU cores you have and
how fast hashing is relative to IO.  For well-cached-in-RAM file sets
it can be order factor-of-10.

Third, in theory, even strong crypto hashes have a small but non-zero
chance of a collision.  So, you may need to deal with the possibility
of false positives anyway.  If have a way to deal with a small false
positive rate then you may not need to strongly hash the WHOLE file.
If any well-defined subrange of files (first or last 64 KiB, say) is
effective at distinguishing files already known to be the same size
then limiting hashing to that range saves a factor of average-file-
size/limit-size (in both IO and hashing time - so, a similar factor
whichever is your throughput bottleneck).

There is some interaction between hash-curtailing and parallelization
in that very small limits can make the bottleneck become non-hashing
costs like copying data/task switching, but for file sets with large
files these are mostly independent optimizations.  There can also be
interaction betweeen spinning rust disk seeking and parellelization.

Putting these all together, you can get a prototype program in Python
not zsh that looks like this (should work on any Python version).
--
from sys      import argv, stderr as err, stdout as out, exc_info
from os       import environ, listdir, stat, lstat ; E = environ.get
from hashlib  import *  # md5, sha1, sha256, sha512, blake2b, etc.
#This is the command-interface; Could be fancier
roots = argv[1:] if len(argv) > 1 else [ "." ]
DIGEST = eval(E("DIGEST", "sha1"))  #Can be any algo exported by hashlib
MIN    =  int(E("MIN"   , "1"))     #Exclude File Smaller Than This (Bytes)
MAX    =  int(E("MAX"   , "-1"))    #Limit IO/Digest work to first MAX Bytes
WASTED = E("WASTED", "")            #Set to non-Empty to get a wasted report
from multiprocessing import Pool, cpu_count  #$PAR controls max parallelism

from stat import S_ISDIR, S_ISREG
def findR(ino2st_paths, root, keep, follow):
    prefix = root + "/"             #Recursive body of find()
    for entry in listdir(root):
        if entry == "." or entry == "..": continue
        path = prefix + entry
        try: st = stat(path) if follow else lstat(path)
        except OSError:
            err.write("findR: %s: %s\n" % (root, exc_info()[1][-1]))
            continue
        if S_ISDIR(st.st_mode):     #Dir ==> Recurse
            findR(ino2st_paths, path, keep, follow)
        if keep(st.st_mode):
            try   : ino2st_paths[st.st_ino].append( path )
            except: ino2st_paths[st.st_ino] = [ st, path ]

def find(roots, keep=S_ISREG, follow=False):
    ino2st_paths = {}
    ino2st_paths = {}
    for root in roots: findR(ino2st_paths, root, keep, follow)
    return ino2st_paths

def Digest(path):
    try: return path, DIGEST(open(path, "rb").read(MAX)).digest()
    except:
        err.write("Cannot digest "+ path + "\n")
        return path, ""
## THE MAIN EVENT ##
path2size = {}                      #path -> size to avoid stat() calls
size2paths = {}                     #size -> paths of *first* hardlinks
for sp in find(roots).values():     #find() gives { ino,[stat,path1,..] }
    if sp[0].st_size < MIN:
        continue
    for p in sp[1:]:
        path2size[p] = sp[0].st_size
    try   : size2paths[sp[0].st_size].append(sp[1]) #only append 1st
hardlink
    except: size2paths[sp[0].st_size] = [ sp[1] ]   #which is sp[1]; see
findR

size_coll = [ ]                     #Build worklist of files w/sz collides.
for paths in size2paths.values():   #Could Digest() here, but get better
    if len(paths) > 1:              #..parallel scale-up to do worklist all
        size_coll.extend(paths)     #..at once if most colls are "few-way".

digest = {}         #Build path -> digest
if MAX == 0:
    for path in size_coll: digest[path] = path2size[path]
else:               #Hashing can be slow; Do most work in a proc pool
    pool = Pool(int(E("PAR", cpu_count())))
    for (path, d) in pool.imap_unordered(Digest, size_coll):
        digest[path] = d + bytes(path2size[path])

paths = {}                          #Invert digest[path] to get collisions
for path in size_coll:
    try   : paths[ digest[path] ].append(path)  #Duplicate digest detected
    except: paths[ digest[path] ] = [ path ]    #First occurrence of digest

ans = []; sz = 0
for paths in paths.values():        #Finally, print sorted answer. Also
    if len(paths) > 1:              #..sort rows internally by mtime.
        ans.append("\t".join(sorted(paths, key=lambda p: stat(p).st_mtime)))
        sz += (len(paths) - 1) * (path2size[paths[0]])
ans.sort()                          #sort alphabetically by first hardlink
out.write("\n".join(ans)); out.write("\n")
if WASTED: out.write("Wasted Space From Duplicates: %d bytes\n" % sz)
--

To re-connect topically with Zsh, you can study the impact of the file
size "weak hash prefix" idea by saving the above program as "dups" and
then doing:
    diff =(dups) =(MAX=0 dups)
or
    diff =(dups) =(MAX=65536 dups)
on a file collection using Zsh's process substitution to study effects of
these ideas.

Alternatively, some ambitious Python hater/Zsh lover could try to re-write
the above with Zsh associative arrays, xargs -P, temp files and the like.
I'm not sure I would recommend that.  A C re-write would likely do better
at minimizing hash curtailing and parallelizing interactions or detecting
when disk seeks cause thrashing and you want to disable parallel IO.


Messages sorted by: Reverse Date, Date, Thread, Author