[tahoe-dev] find out how much your files converge with your friends's
Jeremy Fitzhardinge
jeremy at goop.org
Tue Aug 19 12:01:05 PDT 2008
zooko wrote:
> Hi there, Jeremy:
>
> I'm adding Cc: the p2p-hackers list because I know some of the p2p
> hackers would be interested in this experiment.
>
>
> On Aug 12, 2008, at 20:19 PM, Jeremy Fitzhardinge wrote:
>
> (Brian Warner wrote:)
>
>>> in some quick tests on allmydata customer data
>>> we found the space savings to be less than 1%. You might want to
>>> do some
>>> tests first (hash all your files, have your friends do the same,
>>> measure the
>>> overlap) before worrying about sharing convergence secrets.
>>>
>
>
>> Yes, that would be an interesting experiment to perform anyway.
>>
>
> This prompted me to update my dupfilefind tool to v1.3.0 [1]. To
> install it, run "easy_install dupfilefind". (If you don't have
> easy_install installed, follow these installation instructions: [2].)
>
> Run dupfilefind with the --profiles option and point it at some
> directories. It will run for a long time (overnight?) and eventually
> print to stdout a list of the first 16 bits of the md5sum of each
> file and the filesize of each file, rounded up to 4096 bits. We can
> then compare our lists of 16-bit-md5s and filesizes to find out
> approximately how much data we could save by convergent encryption
> with one another.
>
> Also, dupfilefind is a handy tool for finding identical copies of
> files on your system. :-)
>
> The main difference between dupfilefind 1.3.0 and earlier versions is
> that now it uses a temporary file instead of RAM for its working
> state, which means it will now (eventually) finish no matter how many
> files you point it at. Earlier versions of dupfilefind would
> sometimes use up all your RAM and then fail.
>
OK, to get the ball rolling, I've put my Fedora 9 laptop's profile up on
http://www.goop.org/~jeremy/dupfiles.out.gz (~400k).
I generated this with:
$ dupfilefind -p -I /proc,/sys,/dev,/tmp,/var/tmp > /tmp/dupfiles.out
$ sort -u < /tmp/dupfiles.out > /tmp/dupfiles.uniq.out
to remove any internal duplicates which aren't interesting in comparing
the potential savings between machines.
You can compare N dupfiles to see how much saving there would be with:
$ cat dupfile1 dupfile2 ... | sort | uniq -c | awk 'BEGIN { RS="\n *"; FS="( +)|:" } { saved += ($1-1) * $3 } END { print "space saved: " saved }'
Please post the results, and publish your dupfilefind profiles somewhere.
Thanks,
J
More information about the tahoe-dev
mailing list