Patrick Useldinger a écrit :

Brent Frère wrote:

Do a find. For each file, compute a md5sum. Do a sort of it. Detect the sets of files having matching md5sum. Do a binary compare of each couple of such files. If it matches, you found it !

I am going to write one, as I haven't found what I was looking for.

However, I haven't found a reason why I should use md5sum. It means that I have to read each file at least once entirely to compute the hash, and possibly twice if the hashes match.
Why not compare them directly (blockwise) if their length matches? And stop as soon as they differ?

-pu
_______________________________________________
Lilux-help mailing list
Lilux-help@lilux.lu
http://lilux.lu/mailman/listinfo/lilux-help

Great idea.

You have 100.000 files in an average filesystem. If you compare each couple, this gives you
100.000 * (100.000-1) / 2 combinations, so about 5.000.000.000 file comparaisons, or 10.000.000.000 file read. As example, the first file only will have to be compared do 99.999 others...
Instead, the md5sum stuff reads all the files (indeed) but only once, leading to 100.000 file read. About 100.000 times more performant that your proposal.
The actual file comparison is only a confirmation that the files are indeed exactly the same, and not only sharing the same md5sum by chance (very unlikely), so will be done more than probably on files that HAVE the same content. No waste of time then, because it is the first time the two involved files will be actually compared. You don't wish to flag as identical files the ones that are just sharing the same md5sum and file length, I guess ? Doing so would lead to a M$t-like system: something that works properly sometimes, and has strange behaviour in some unpredictable, unidentified circumstances, and even sometimes a non causal behaviour. Do your choice.

For very small "filesystems" (~10 files), your algorithm might be more efficient, depending on the file content (your comparaison might indeed not read the file up to the end if properly implemented), but computer science learns us that an algorithm having a lower computational complexity is always the good choice in long-term.

Yours,

-- 
Brent Frère

Private e-mail:  Brent@BFrere.net

Postal address: 5, rue de Mamer
                L-8280 Kehlen
                Grand-Duchy of Luxembourg
                European Union

Mobile: +352-021/29.05.98
Fax:    +352-26.30.05.96
Home:   +352-307.341
URL:    http://BFrere.net

If you have problem with my digital signature, please install the appropriate authority certificate by browsing https://www.cacert.org/certs/root.crt.