Patrick Useldinger a écrit :
Brent Frère
wrote:
Do a find. For each file, compute a md5sum.
Do a sort of it. Detect the sets of files having matching md5sum. Do a
binary compare of each couple of such files. If it matches, you found
it !
I am going to write one, as I haven't found what I was looking for.
However, I haven't found a reason why I should use md5sum. It means
that I have to read each file at least once entirely to compute the
hash, and possibly twice if the hashes match.
Why not compare them directly (blockwise) if their length matches? And
stop as soon as they differ?
-pu
Maybe a (non patentable) idea here: what about computing the md5sum on
the first KB of each file ?
Roughly speaking:
# find . -type f -exec head --bytes=1k {}
| md5sum | sort > md5sum.lst \;
# uniq md5sum.lst > md5sum.uniq
# for each couple in `diff md5sum.lst md5sum.uniq`; do
> cmp $1 $2
> done
Tuning the 1k value is a question of compromise between first pass file
reading (the find command) and the risk to have unecessary cmp commands
in the "for" loop. The chance of having files sharing the first 1kB is
low in usual situations (but maybe with some M$ obscure formats ?) and
should be fast (any read on a HD implies several blocks of 512B each).
It is irrelevant as long as the files have non related contents.
Otherwise, you can consider re-running à find with larger 'head' size,
if the amount of couples in the for is too large, or trying tail
instead of head.
I leave you this part of the job. If you do so, why not intergrating it
into RSYNC, so that the renaming of folders will no more kill the
performances of the tool ?
--
Brent Frère
Private e-mail: Brent@BFrere.net
Postal address: 5, rue de Mamer
L-8280 Kehlen
Grand-Duchy of Luxembourg
European Union
Mobile: +352-021/29.05.98
Fax: +352-26.30.05.96
Home: +352-307.341
URL: http://BFrere.net
If you have problem with my digital signature, please install the appropriate authority certificate by browsing https://www.cacert.org/certs/root.crt.