Patrick Useldinger a écrit :
Brent Frère
wrote:
Do a find. For each file, compute a md5sum.
Do a sort of it. Detect the sets of files having matching md5sum. Do a
binary compare of each couple of such files. If it matches, you found
it !
I am going to write one, as I haven't found what I was looking for.
However, I haven't found a reason why I should use md5sum. It means
that I have to read each file at least once entirely to compute the
hash, and possibly twice if the hashes match.
Why not compare them directly (blockwise) if their length matches? And
stop as soon as they differ?
-pu
_______________________________________________
Lilux-help mailing list
Lilux-help@lilux.lu
http://lilux.lu/mailman/listinfo/lilux-help
Great idea.
You have 100.000 files in an average filesystem. If you compare each
couple, this gives you
100.000 * (100.000-1) / 2 combinations, so about 5.000.000.000 file
comparaisons, or 10.000.000.000 file read. As example, the first file
only will have to be compared do 99.999 others...
Instead, the md5sum stuff reads all the files (indeed) but only once,
leading to 100.000 file read. About 100.000 times more performant that
your proposal.
The actual file comparison is only a confirmation that the files are
indeed exactly the same, and not only sharing the same md5sum by chance
(very unlikely), so will be done more than probably on files that HAVE
the same content. No waste of time then, because it is the first time
the two involved files will be actually compared. You don't wish to
flag as identical files the ones that are just sharing the same md5sum
and file length, I guess ? Doing so would lead to a M$t-like system:
something that works properly sometimes, and has strange behaviour in
some unpredictable, unidentified circumstances, and even sometimes a
non causal behaviour. Do your choice.
For very small "filesystems" (~10 files), your algorithm might be more
efficient, depending on the file content (your comparaison might indeed
not read the file up to the end if properly implemented), but computer
science learns us that an algorithm having a lower computational
complexity is always the good choice in long-term.
Yours,
--
Brent Frère
Private e-mail: Brent@BFrere.net
Postal address: 5, rue de Mamer
L-8280 Kehlen
Grand-Duchy of Luxembourg
European Union
Mobile: +352-021/29.05.98
Fax: +352-26.30.05.96
Home: +352-307.341
URL: http://BFrere.net
If you have problem with my digital signature, please install the appropriate authority certificate by browsing https://www.cacert.org/certs/root.crt.