Re: [Lilux-help] Duplicate files

20 Feb 2005

Patrick Useldinger a écrit :
...
  Brent Frère wrote:
  Do a find. For each file, compute a md5sum. Do a
sort of it. Detect
 the sets of files having matching md5sum. Do a binary compare of each
 couple of such files. If it matches, you found it ! 
 I am going to write one, as I haven't found what I was looking for.
 However, I haven't found a reason why I should use md5sum. It means
 that I have to read each file at least once entirely to compute the
 hash, and possibly twice if the hashes match.
 Why not compare them directly (blockwise) if their length matches? And
 stop as soon as they differ?
 -pu 
Maybe a (non patentable) idea here: what about computing the md5sum on
the first KB of each file ?
Roughly speaking:
# find . -type f -exec *head --bytes=1k {}* | md5sum | sort > md5sum.lst \;
# uniq md5sum.lst > md5sum.uniq
# for each couple in `diff md5sum.lst md5sum.uniq`; do
...
     cmp $1 $2
    done 
Tuning the 1k value is a question of compromise between first pass file
reading (the find command) and the risk to have unecessary cmp commands
in the "for" loop. The chance of having files sharing the first 1kB is
low in usual situations (but maybe with some M$ obscure formats ?) and
should be fast (any read on a HD implies several blocks of 512B each).
It is irrelevant as long as the files have non related contents.
Otherwise, you can consider re-running à find with larger 'head' size,
if the amount of couples in the for is too large, or trying tail instead
of head.
I leave you this part of the job. If you do so, why not intergrating it
into RSYNC, so that the renaming of folders will no more kill the
performances of the tool ?
--
Brent Frère
Private e-mail:  Brent(a)BFrere.net
Postal address: 5, rue de Mamer
                L-8280 Kehlen
                Grand-Duchy of Luxembourg
                European Union
Mobile: +352-021/29.05.98
Fax:    +352-26.30.05.96
Home:   +352-307.341
URL:    http://BFrere.net
If you have problem with my digital signature, please install the appropriate authority
certificate by browsing https://www.cacert.org/certs/root.crt.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

Re: [Lilux-help] Duplicate files