Re: [Lilux-help] Duplicate files

22 Feb 2005

Brent Frère wrote:
...
  Sorry about that. Just my style. Not sarcastic at all.
I even think we
 go in an interesting and constructive direction (big kiss :-) ). I even
 need this for a customer project (the one that is described in page 20
 of the  February LuxBox issue) 
Good, you can give me your feedback then, I'll try to incorporate as
much as possible.
...
  I'm also interested in smaller files... I know
you don't really care,  
I didn't say that. The main thing is that the minimum length must be
larger than 0. I've made it a command-line option, so you chose.
...
  You have as hypothesis that the amount of files of the
same size is
 low... What about filesystems containing huge amount of same size files
 ? Example: huge backups splitted in 700 MB slices, or storage of floppy
 images (all 1.44 MB in size) ? 
I've added a parameter to specify the maximum size to be allocated to
all the buffers, as well as the maximum size to be allocated to one
single buffer, that should do.
...
  You also take as hypothesis that you should have
enough ram to store 1kB
 per file... I worked in embedded environments, and I think in this case
 as well as in PDAs or cell-phones, it should be avoided. That's why I
 proposed to just read the first kB and keep in memory the md5sum. But
 you obviously dislike md5sum... :-) 
Everybody goes like "hey, use md5", but nobody can give me a valid
argument. The only one I can think of myself is why rsync uses it:
calculating md5 is easy on each side, and transferring the md5 on the
network to check if files might be identical can avoid you to transfer
the whole file. That makes sense, but my program is not network-based. I
  have to read the whole files to calculate md5, whereas I have to read
at most the whole files to compare.
...
  Do you think the cmp command reads the files up to the
end, even if they
 have unequal length or a difference at the beginning ??? 
I am not talking about compare, but about md5. You MUST read it all to
calculate it, no choice.
...
  I though about the first or last kB. Unfortunately, it
exist files
 having the same beginning (think about zip files containing about the
 same content: the beginning is likely to contain the list of contained
 files, so is likely to be the same. Think about CD iso images: the first
 kB is likely to be the same... I have less examples with last kB, but CD
 or floppies images are highly likely to share the same content. So I
 think taking the kB closest to the middle of the file is a good idea,
 and the entire file content if the file is less than 1kB in length.
 Let's call the md5sum of this block the 'signature' of the file. 
Beginning, middle, end - statistically, they should be equal, no?
-pu

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

Re: [Lilux-help] Duplicate files