[tahoe-dev] measure your convergence

zooko zooko at zooko.com
Thu Mar 20 13:38:34 PDT 2008


Folks:

Ever wondered how much storage space you would save if you and your  
friends coalesced all of your identical files?

Wonder no longer!  Now you can find out!  Install the "dupfilefind"  
utility [*] and run it with command-line arguments like:

dupfilefind --ignore-dirs="," --min-size=32 --profiles

(It probably works on all operating systems.)

It will recursively examine all files reachable from the current  
working directory and spew out a series of "hashcode filesize" pairs,  
where the hashcode is the least significant 8 bits of the adler32  
checksum of the first 8192 bytes of the file.

It will also mention whenever it finds two separate files on your  
system which are identical with each other.

Send the output to your friends, or to me -- zooko at zooko.com -- and  
we'll find out approximately how many of your files are shared with  
other people who submit results.  (Please compress your output with a  
good compressor like 7zip or rzip or bzip2.)

You take full responsibility for leaking all this information about  
your files -- namely their 8 bit adler32 sums of their first 8192  
bytes, and their file size.  Also, in case duplicate files are  
detected on your system, their device number and inode number.

Regards,

Zooko

[*] To install the dupfilefind utility, either download this tarball:

http://pypi.python.org/packages/source/d/dupfilefind/ 
dupfilefind-1.1.2.tar.gz#md5=af8de6f3ead053e326389a9a87b0a11d

untar it, cd into the resulting directory, and run:

python setup.py install

or else install the easy_install tool:

http://peak.telecommunity.com/DevCenter/EasyInstall#installing-easy- 
install

and then run:

easy_install dupfilefind




More information about the tahoe-dev mailing list