Do you think
the cmp command reads the files up to the end, even if
they have unequal length or a difference at the beginning ???
I am not talking about compare, but about md5. You MUST read it all to
calculate it, no choice.
The 'signature' of a file is for me just a checksum of a (well chosen)
part of it.
It does not guarantee that the files are equal, just that they are good
candidates. And that's all what I'm looking for at the beginning...
I though about
the first or last kB. Unfortunately, it exist files
having the same beginning (think about zip files containing about the
same content: the beginning is likely to contain the list of
contained files, so is likely to be the same. Think about CD iso
images: the first kB is likely to be the same... I have less examples
with last kB, but CD or floppies images are highly likely to share
the same content. So I think taking the kB closest to the middle of
the file is a good idea, and the entire file content if the file is
less than 1kB in length. Let's call the md5sum of this block the
'signature' of the file.
Beginning, middle, end - statistically, they should be equal, no ?
md5sum of the first kB of a file, of the last block, or of the middle
block has no chance to be the same.
As I just explained, some files have usually the same beginning, or the
same end, so in order to maximise the chance to have a discriminant
checksum, let's compute it systematically on the 1kB block nearest of
the middle of the file (on multiple of 1024 bytes boundaries).
--
Brent Frère
Private e-mail: Brent(a)BFrere.net
Postal address: 5, rue de Mamer
L-8280 Kehlen
Grand-Duchy of Luxembourg
European Union
Mobile: +352-021/29.05.98
Fax: +352-26.30.05.96
Home: +352-307.341
URL:
http://BFrere.net
This e-mail signature can be checked if you have the CaCERT certificate installed.
Check
http://www.CaCERT.org for details.