[tahoe-dev] find out how much your files converge with your friends's

Tue Aug 19 12:01:05 PDT 2008

zooko wrote:
> Hi there, Jeremy:
>
> I'm adding Cc: the p2p-hackers list because I know some of the p2p  
> hackers would be interested in this experiment.
>
>
> On Aug 12, 2008, at 20:19 PM, Jeremy Fitzhardinge wrote:
>
> (Brian Warner wrote:)
>   
>>> in some quick tests on allmydata customer data
>>> we found the space savings to be less than 1%. You might want to  
>>> do some
>>> tests first (hash all your files, have your friends do the same,  
>>> measure the
>>> overlap) before worrying about sharing convergence secrets.
>>>       
>
>   
>> Yes, that would be an interesting experiment to perform anyway.
>>     
>
> This prompted me to update my dupfilefind tool to v1.3.0 [1].  To  
> install it, run "easy_install dupfilefind".  (If you don't have  
> easy_install installed, follow these installation instructions: [2].)
>
> Run dupfilefind with the --profiles option and point it at some  
> directories.  It will run for a long time (overnight?) and eventually  
> print to stdout a list of the first 16 bits of the md5sum of each  
> file and the filesize of each file, rounded up to 4096 bits.  We can  
> then compare our lists of 16-bit-md5s and filesizes to find out  
> approximately how much data we could save by convergent encryption  
> with one another.
>
> Also, dupfilefind is a handy tool for finding identical copies of  
> files on your system.  :-)
>
> The main difference between dupfilefind 1.3.0 and earlier versions is  
> that now it uses a temporary file instead of RAM for its working  
> state, which means it will now (eventually) finish no matter how many  
> files you point it at.  Earlier versions of dupfilefind would  
> sometimes use up all your RAM and then fail.
>   

OK, to get the ball rolling, I've put my Fedora 9 laptop's profile up on
http://www.goop.org/~jeremy/dupfiles.out.gz (~400k).

I generated this with:

$ dupfilefind -p -I /proc,/sys,/dev,/tmp,/var/tmp > /tmp/dupfiles.out
$ sort -u < /tmp/dupfiles.out > /tmp/dupfiles.uniq.out

to remove any internal duplicates which aren't interesting in comparing
the potential savings between machines.

You can compare N dupfiles to see how much saving there would be with:

$ cat dupfile1 dupfile2 ... | sort | uniq -c | awk 'BEGIN { RS="\n *"; FS="( +)|:" } { saved += ($1-1) * $3 } END { print "space saved: " saved }'

Please post the results, and publish your dupfilefind profiles somewhere.

Thanks,
    J