[tahoe-dev] find out how much your files converge with your friends's

zooko zooko at zooko.com
Mon Aug 18 15:58:43 PDT 2008


Hi there, Jeremy:

I'm adding Cc: the p2p-hackers list because I know some of the p2p  
hackers would be interested in this experiment.


On Aug 12, 2008, at 20:19 PM, Jeremy Fitzhardinge wrote:

(Brian Warner wrote:)
>> in some quick tests on allmydata customer data
>> we found the space savings to be less than 1%. You might want to  
>> do some
>> tests first (hash all your files, have your friends do the same,  
>> measure the
>> overlap) before worrying about sharing convergence secrets.

> Yes, that would be an interesting experiment to perform anyway.
>>

This prompted me to update my dupfilefind tool to v1.3.0 [1].  To  
install it, run "easy_install dupfilefind".  (If you don't have  
easy_install installed, follow these installation instructions: [2].)

Run dupfilefind with the --profiles option and point it at some  
directories.  It will run for a long time (overnight?) and eventually  
print to stdout a list of the first 16 bits of the md5sum of each  
file and the filesize of each file, rounded up to 4096 bits.  We can  
then compare our lists of 16-bit-md5s and filesizes to find out  
approximately how much data we could save by convergent encryption  
with one another.

Also, dupfilefind is a handy tool for finding identical copies of  
files on your system.  :-)

The main difference between dupfilefind 1.3.0 and earlier versions is  
that now it uses a temporary file instead of RAM for its working  
state, which means it will now (eventually) finish no matter how many  
files you point it at.  Earlier versions of dupfilefind would  
sometimes use up all your RAM and then fail.

Regards,

Zooko

[1] http://allmydata.org/trac/dupfilefind
[2] http://pypi.python.org/pypi/setuptools/0.6c8#installation- 
instructions


More information about the tahoe-dev mailing list