Re: [Lilux-help] Duplicate files

20 Feb 2005

Patrick Useldinger a écrit :
...
  Brent Frère wrote:
  Great idea. 
 Why do you always have to be so sarcastic? And quick to shoot? Try to
 be a little bit more constructive... 
Sorry about that. Just my style. Not sarcastic at all. I even think we
go in an interesting and constructive direction (big kiss :-) ). I even
need this for a customer project (the one that is described in page 20
of the  February LuxBox issue)
...
  If you had read my mail, you'd know that my
intention was to compare
 only those files who have the same length. 
I didn't noticed that indeed. However, my need is general, and I might
have several files of EXACTLY the same size (typically, several Oracle
datafiles of 2GB), so if I put in the same set files having the same
size, it will be very inefficient. The trick of the first kB is a good
idea in this case.
...
  If you want a more algorithmic description,
here's what I have in mind:
 - build a list of all the files which have the same length and which
 are larger than 1KB 
I'm also interested in smaller files... I know you don't really care,
but think about a storage of icons. Hundred of thousands of files
smaller than 1kB. Somebody renames (stupidely) the folder name from
'icon' to 'icons'. Without this tool, my rsync will spend the night
copying files in a newly created directory, even if they are already
there... So that's why your tool looks interesting to me, espectially
integrated with RSYNC.
...
  - for each group of files of the same length
   - read the first block (1KB) of each file
   - compare the blocks in memory one to another 
You have as hypothesis that the amount of files of the same size is
low... What about filesystems containing huge amount of same size files
? Example: huge backups splitted in 700 MB slices, or storage of floppy
images (all 1.44 MB in size) ?
What's right in your thoughs is that indeed we have a very cheap info
for each file, which is the size. It should be used indeed before
computing the md5sum.
You also take as hypothesis that you should have enough ram to store 1kB
per file... I worked in embedded environments, and I think in this case
as well as in PDAs or cell-phones, it should be avoided. That's why I
proposed to just read the first kB and keep in memory the md5sum. But
you obviously dislike md5sum... :-)
...
    - throw out those who are different to all the
others
   - repeat until no file is left in the pool or end of files
   - print the files which are left in the pool
 So in the _worst_ case, that is if all files are equal, I read each
 one entirely. That is the _best_ case in your approach. 
Do you think the cmp command reads the files up to the end, even if they
have unequal length or a difference at the beginning ???
...
  Unless I am missing something, of course.
  the two involved files will be actually compared.
You don't wish to
 flag as identical files the ones that are just sharing the same
 md5sum and file length, I guess ? Doing so would lead to a M$t-like
 system: something that works properly sometimes, and has strange
 behaviour in some unpredictable, unidentified circumstances, and even
 sometimes a non causal behaviour. Do your choice. 
 Stop this crap, please. 
Ok. But I wouldn't be surprised to see such things in M$ code... :-)
I though about the first or last kB. Unfortunately, it exist files
having the same beginning (think about zip files containing about the
same content: the beginning is likely to contain the list of contained
files, so is likely to be the same. Think about CD iso images: the first
kB is likely to be the same... I have less examples with last kB, but CD
or floppies images are highly likely to share the same content. So I
think taking the kB closest to the middle of the file is a good idea,
and the entire file content if the file is less than 1kB in length.
Let's call the md5sum of this block the 'signature' of the file.
So, my proposal is:
Make a list of filesizes. Drop the one that have unique length.
For each sets of equal sizes,
  If the set has cardinality of 2, do a performant cmp (a one that stops
as soon as the files differs)
  Otherwise, compute the list of signatures.
  For each identical signatures, do a progressive multi-file compare, as
you described above.
    If the comparaison doesn't match any other file of the set, drop the
file.
    If the comparaison reaches the end of the files without differences,
all the files of the set are identical.
    If there are differences, but at least two files matches in each
different set, keep comparing the first set and add the second to the
nearest for loop. (may be improved)
  roF
roF
Yours,
--
Brent Frère
Private e-mail:  Brent(a)BFrere.net
Postal address: 5, rue de Mamer
                L-8280 Kehlen
                Grand-Duchy of Luxembourg
                European Union
Mobile: +352-021/29.05.98
Fax:    +352-26.30.05.96
Home:   +352-307.341
URL:    http://BFrere.net
If you have problem with my digital signature, please install the appropriate authority
certificate by browsing https://www.cacert.org/certs/root.crt.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

Re: [Lilux-help] Duplicate files