[tahoe-dev] find out how much your files converge with your friends's
zooko
zooko at zooko.com
Mon Aug 18 15:58:43 PDT 2008
Hi there, Jeremy:
I'm adding Cc: the p2p-hackers list because I know some of the p2p
hackers would be interested in this experiment.
On Aug 12, 2008, at 20:19 PM, Jeremy Fitzhardinge wrote:
(Brian Warner wrote:)
>> in some quick tests on allmydata customer data
>> we found the space savings to be less than 1%. You might want to
>> do some
>> tests first (hash all your files, have your friends do the same,
>> measure the
>> overlap) before worrying about sharing convergence secrets.
> Yes, that would be an interesting experiment to perform anyway.
>>
This prompted me to update my dupfilefind tool to v1.3.0 [1]. To
install it, run "easy_install dupfilefind". (If you don't have
easy_install installed, follow these installation instructions: [2].)
Run dupfilefind with the --profiles option and point it at some
directories. It will run for a long time (overnight?) and eventually
print to stdout a list of the first 16 bits of the md5sum of each
file and the filesize of each file, rounded up to 4096 bits. We can
then compare our lists of 16-bit-md5s and filesizes to find out
approximately how much data we could save by convergent encryption
with one another.
Also, dupfilefind is a handy tool for finding identical copies of
files on your system. :-)
The main difference between dupfilefind 1.3.0 and earlier versions is
that now it uses a temporary file instead of RAM for its working
state, which means it will now (eventually) finish no matter how many
files you point it at. Earlier versions of dupfilefind would
sometimes use up all your RAM and then fail.
Regards,
Zooko
[1] http://allmydata.org/trac/dupfilefind
[2] http://pypi.python.org/pypi/setuptools/0.6c8#installation-
instructions
More information about the tahoe-dev
mailing list