Brent Frère wrote:
Sorry about that. Just my style. Not sarcastic at all.
I even think we
go in an interesting and constructive direction (big kiss :-) ). I even
need this for a customer project (the one that is described in page 20
of the February LuxBox issue)
Good, you can give me your feedback then, I'll try to incorporate as
much as possible.
I'm also interested in smaller files... I know
you don't really care,
I didn't say that. The main thing is that the minimum length must be
larger than 0. I've made it a command-line option, so you chose.
You have as hypothesis that the amount of files of the
same size is
low... What about filesystems containing huge amount of same size files
? Example: huge backups splitted in 700 MB slices, or storage of floppy
images (all 1.44 MB in size) ?
I've added a parameter to specify the maximum size to be allocated to
all the buffers, as well as the maximum size to be allocated to one
single buffer, that should do.
You also take as hypothesis that you should have
enough ram to store 1kB
per file... I worked in embedded environments, and I think in this case
as well as in PDAs or cell-phones, it should be avoided. That's why I
proposed to just read the first kB and keep in memory the md5sum. But
you obviously dislike md5sum... :-)
Everybody goes like "hey, use md5", but nobody can give me a valid
argument. The only one I can think of myself is why rsync uses it:
calculating md5 is easy on each side, and transferring the md5 on the
network to check if files might be identical can avoid you to transfer
the whole file. That makes sense, but my program is not network-based. I
have to read the whole files to calculate md5, whereas I have to read
at most the whole files to compare.
Do you think the cmp command reads the files up to the
end, even if they
have unequal length or a difference at the beginning ???
I am not talking about compare, but about md5. You MUST read it all to
calculate it, no choice.
I though about the first or last kB. Unfortunately, it
exist files
having the same beginning (think about zip files containing about the
same content: the beginning is likely to contain the list of contained
files, so is likely to be the same. Think about CD iso images: the first
kB is likely to be the same... I have less examples with last kB, but CD
or floppies images are highly likely to share the same content. So I
think taking the kB closest to the middle of the file is a good idea,
and the entire file content if the file is less than 1kB in length.
Let's call the md5sum of this block the 'signature' of the file.
Beginning, middle, end - statistically, they should be equal, no?
-pu