Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: find duplicate files



>I find that a LOT more understandable than the python code.

Understandability is, of course, somewhat subjective (e.g. some might say
every 15th field is unclear relative to a named label), but the python also
does do more and would be quite a bit simpler without any parallelism, and
the Zsh code more complex with a set up for it.  So, it's a somewhat apples
& oranges/strawman comparison.  No offense intended.  I don't dislike your
solution or a simpler Python one.  Simplicity is good.  As elaborated below,
everything sort of "just depends".

Some hash functions are really fast..some IO systems really slow, etc.
To really do the parallel IO/hashing right (be nice to competition both
from self & programs on the system) you probably want to start with a
parallelism of 1 CPU and dynamically adjust based on CPU utilization
say after each file via getrusage in some parallel thread coordinator).
Such would be more complex still, but as we both mentioned will be more
efficient in a lower level language.


>This should be vanishingly small if the sizes are the same?

Of course, if you have a hash with no collision attack or no reason to
suspect a filesystem collision attack, then yes, a vanishingly small
probability.  I did not mean to overemphasize the risk as much as use it
to introduce the "IO/hash truncated" optimization idea trading off some
false positive risk with performance.  I personally used the report for
rare-case manual intervention, not automated response.  So, cmp at the
end makes total sense.  I don't think there is any disagreement here.

Another motivation on the truncated IO/hash optimization that may be
underappreciated is that if your report is empty then you are done.
False negatives don't happen.  That might well be the common case after
some situation is cleaned up and you want a "quick check" that there are
no duplicates or if you otherwise "suspect but do not know" no duplicates
for some file set.  If that quick check fails then you can do a slower,
more precise job.  With large files or slow IO this can help a lot.
Call it an 'is it possible there's a duplicate' operation if you will.

Partial hashes can also be a jumping off point to detection of "corrupt"
or near-duplicate files.  The many ways corruption can happen generates
a list of questions/approaches which is off topic.  One way it can happen
is non-size-changing scribbling over/non-population of just a few bytes
which does relate to partial hashing, though obviously it's not perfectly
tuned for such an application.


>I haven't dealt with the files having strange characters in the names
>that might prevent them being easily used as hash keys, that's left as
>an exercise

Part of my purpose writing the Python was indeed being able to deal with
weird path names which I also generally try hard to avoid, but sometimes
it's beyond one's control/authority.  I almost added user parameter for
the hard TAB and NEWLINE chars used for output delimiters, but tried to
keep it simple for exposition.  With utf8 these days...truly general
filenames are decidedly a PITA.


>unless you're NOT going to consider linked files as duplicates you
>might as well just compare sizes.  (It would be faster to get inodes

It may have been underappreciated is that handling hard-link identity also
lets you skip recomputing hashes over a hard link cluster which must be the
same.  (Or re-doing the IO if the time between hitting that i-node is long
enough to lose buffer caching.)  One thinkable, possibly even automatic,
resolution of duplicates is conversion into hard links.  If there was a
lot of duplication resolved this way then this would save time on a "second
pass" over such file hierarchies.  I'm just defending that admittedly
likely minor optimization (which just wasn't so hard to add in Python).

Most claims in this whole space depend upon application context in some
way.  Some are hard, like files being updated during the duplicate scan.
Yeah..You can look at time stamps, and maybe re-compute things, but ick.
If things are changing who knows what version the user wants assessed?
Duplicates are probably rare.  Large trees of heavily hard-linked files
are probably a much smaller optimization than using file size as a weak
hash to organize the calculation.  You can probably find a hash function
faster than your IO.  Or maybe you have files on some quad m.2 NVMe raid
setup delivering 13 gigabytes/sec and it is harder to match that with a
crypto hash function.  Almost everything you say needs a "probably/maybe"
qualifier.  I don't think you disagree.  I'm just elaborating a little
for passers by.

P.S.:
    Two of my side-comments got auto-wrapped at 76 text columns.  Should
be an easy fix for anyone cutting & pasting that python code.  Sorry!


Messages sorted by: Reverse Date, Date, Thread, Author